diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..e26f8e5 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,69 @@ +# Embedding & Vector Search System + +AI-powered embedding and vector search system that processes images, PDFs, and text to generate embeddings, stores them in vector databases, and provides semantic search via a Flask API and React frontend. Includes GPT-4 powered document analysis for resume/job description matching and M&A document processing. + +## Architecture + +``` + ┌──────────────────────┐ + │ React Frontend │ + │ (Mantine + Vite) │ + └──────┬───────┬────────┘ + │ │ + REST API │ │ GraphQL + (Axios) │ │ (Apollo Client) + │ │ + ┌──────────────▼─┐ ┌─▼──────────────┐ + │ Flask API │ │ Weaviate │ + │ (clip_app.py) │ │ Vector DB │ + │ Port 5000 │ │ Port 8080 │ + └──────┬─────────┘ └────────────────┘ + │ + ┌─────────▼─────────┐ + │ CLIP Model │ + │ (ViT-L/14) │ + └───────────────────┘ + + ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ + │ embedding.py │ │ extract_pdf.py │ │ process.py │ + │ CLIP/MediaPipe │ │ PyMuPDF │ │ GPT-4 Analysis │ + │ → SQLite │ │ Text + Images │ │ Resume/JD/M&A │ + └─────────────────┘ └──────────────────┘ └──────────────────┘ + + ┌─────────────────┐ ┌──────────────────┐ + │ clickhouse.py │ │ use_weaviate.py │ + │ Vector Storage │ │ Vector Storage │ + └─────────────────┘ └──────────────────┘ +``` + +## Data Flow + +1. **Ingestion** -- Images and PDFs are processed by `embedding.py` or `extract_pdf.py` +2. **Embedding** -- CLIP (ViT-L/14) or MediaPipe generates vector embeddings +3. **Storage** -- Embeddings are stored in SQLite, ClickHouse, or Weaviate +4. **Search** -- The React frontend sends text queries to the Flask API, which generates a CLIP text embedding and queries Weaviate for nearest neighbors +5. **Display** -- Results are shown as image cards with similarity distances, album views, and "find similar" functionality + +## Components + +| Component | Description | Doc | +|-----------|-------------|-----| +| Flask API | REST API for embedding text and serving images | [backend-api.md](backend-api.md) | +| Embedding Generation | CLIP and MediaPipe embedding CLI | [embedding.md](embedding.md) | +| PDF Extraction | Text and image extraction from PDFs | [pdf-extraction.md](pdf-extraction.md) | +| Document Processing | GPT-4 document analysis (resume/JD matching) | [document-processing.md](document-processing.md) | +| Pose Detection | MediaPipe pose landmark detection | [pose-detection.md](pose-detection.md) | +| Vector Databases | ClickHouse, Weaviate, and SQLite integrations | [vector-databases.md](vector-databases.md) | +| React Frontend | Search UI with image gallery | [frontend.md](frontend.md) | +| Playground | Experimental multi-pass GPT processing | [playground.md](playground.md) | +| Setup | Installation and configuration | [setup.md](setup.md) | + +## Tech Stack + +**Backend:** Python, Flask, PyTorch, CLIP (ViT-L/14), MediaPipe, PyMuPDF, OpenAI API, Pydantic, SQLAlchemy, pandas + +**Frontend:** React 18, TypeScript, Vite, Mantine UI, Apollo Client, Axios, weaviate-ts-client + +**Databases:** Weaviate (vector search + GraphQL), ClickHouse (vector storage), SQLite (local embeddings), PostgreSQL (playground) + +**Infrastructure:** Docker (multi-stage Nginx build for frontend), CUDA auto-detection for GPU acceleration diff --git a/docs/backend-api.md b/docs/backend-api.md new file mode 100644 index 0000000..a87ad27 --- /dev/null +++ b/docs/backend-api.md @@ -0,0 +1,98 @@ +# Backend API (`clip_app.py`) + +Flask REST API that provides CLIP text embedding and image serving endpoints. Runs on port 5000 with CORS enabled. + +## Startup + +Loads the CLIP ViT-L/14 model on startup. Automatically uses CUDA if available, otherwise falls back to CPU. + +```bash +python backend/clip_app.py +``` + +The server runs on `0.0.0.0:5000` in debug mode. + +## Endpoints + +### `POST /embed` + +Generates a CLIP text embedding. + +**Request body:** +```json +{ + "text": "a sleepy ridgeback dog" +} +``` + +**Response:** +```json +{ + "embedding": [0.123, -0.456, ...] +} +``` + +The embedding is a 768-dimensional float array (CLIP ViT-L/14 output). + +**Errors:** +- `400` -- No `text` field provided in request body + +### `GET /image` + +Returns a full-size image file. + +**Query parameters:** +- `fileName` (required) -- Image filename. The date folder is extracted from the first 8 characters (YYYYMMDD format). + +**Response:** JPEG image file + +**Path resolution:** `{MOUNT}/{YYYYMMDD}/{fileName}` + +**Errors:** +- `400` -- No `fileName` provided +- `404` -- File not found at resolved path + +### `GET /thumbnail` + +Returns a resized thumbnail (max 600x600 pixels). + +**Query parameters:** +- `fileName` (required) -- Same format as `/image` + +**Response:** JPEG image (resized to fit within 600x600 while maintaining aspect ratio) + +Thumbnails are generated on-the-fly using PIL. Not cached. + +**Errors:** +- `400` -- No `fileName` provided +- `404` -- File not found + +### `GET /list_files_by_date` + +Lists all files in a date-based directory. + +**Query parameters:** +- `fileName` (required) -- Any string with at least 8 characters. The first 8 characters are used as the date folder name (YYYYMMDD). + +**Response:** +```json +{ + "files": ["20230101_001.jpg", "20230101_002.jpg"] +} +``` + +**Errors:** +- `400` -- Invalid or missing `fileName` (less than 8 characters) +- `404` -- Directory not found + +## Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `MOUNT` | Base path for image files | `""` (empty string) | + +## Integration + +The frontend (`EmbeddingSearch.tsx`) uses this API in two ways: +1. `POST /embed` to convert search text into a vector, then queries Weaviate directly with the vector +2. `GET /image`, `GET /thumbnail`, and `GET /list_files_by_date` to display results diff --git a/docs/document-processing.md b/docs/document-processing.md new file mode 100644 index 0000000..74fd1e8 --- /dev/null +++ b/docs/document-processing.md @@ -0,0 +1,74 @@ +# Document Processing (`process.py`) + +GPT-4 powered document analysis tool. Supports two modes: single-file resume/job description matching, and batch directory processing with concurrent API calls. + +## Usage + +```bash +# Single file: resume against a job description +python backend/process.py --txtfile resume.txt --prompt resume_prompt.txt --job_description posting.txt + +# Batch: process all .txt files in a directory +python backend/process.py --dir ../outfolder --prompt resume_prompt.txt --job_description posting.txt +``` + +## CLI Options + +| Flag | Default | Description | +|------|---------|-------------| +| `--txtfile` | -- | Path to a single text file (resume) | +| `--dir` | `../outfolder` | Directory of text files to process | +| `--prompt` | `resume_prompt.txt` | Prompt template file | +| `--job_description` | `posting.txt` | Job description text file | +| `--csv` | `results.csv` | Output CSV file path | + +Either `--txtfile` or `--dir` must be provided. + +## Pipeline + +### Single File Mode (`--txtfile`) + +1. Reads the resume text file, prompt template, and job description +2. Substitutes `[resume_text]` and `[job_description]` placeholders in the prompt +3. Sends the assembled prompt to GPT-4 (`gpt-4-1106-preview`) with JSON response format +4. Saves the response as `{txtfile}.json` + +### Batch Directory Mode (`--dir`) + +1. Scans the directory for all `.txt` files +2. Processes each file using `ThreadPoolExecutor` with 20 workers +3. Each file goes through the single-file pipeline concurrently + +### Multi-Page Mode (internal) + +The `process_file_and_prompt_multi_pages` function: +1. Reads a text file and splits it by `` marker separating pages. Newlines are preserved as-is. + +### Token Injection Mode (`--inject_tokens`) + +Text is processed with additional normalization and tagging: + +1. **Whitespace normalization** -- Newlines replaced with spaces, multiple spaces collapsed +2. **Segment extraction** -- Text is split at Roman numeral markers `(I)`, `(II)`, etc. and alphabetical markers `(a)`, `(b)`, etc. +3. **Sentence normalization** -- Segments are adjusted to be between 120-360 characters: + - Short segments (< 120 chars) are merged with the next segment + - Long segments (> 360 chars) are split at the last comma or period before the 360-char limit +4. **Tag injection** -- Each normalized segment is tagged as `{text}` + +### Constants + +| Constant | Value | Description | +|----------|-------|-------------| +| `MIN_CHARS` | 120 | Minimum characters for a text segment | +| `MAX_CHARS` | 360 | Maximum characters for a text segment | + +## Image Extraction + +Images are extracted from each page using PyMuPDF's `get_images(full=True)`. Each image is saved as a JPEG file named `{page_number}-{image_index}.jpg`. diff --git a/docs/playground.md b/docs/playground.md new file mode 100644 index 0000000..26452b0 --- /dev/null +++ b/docs/playground.md @@ -0,0 +1,107 @@ +# Playground (Experimental Scripts) + +The `playground/` directory contains experimental scripts for multi-pass GPT document processing with PostgreSQL and Weaviate storage. These scripts implement a more advanced pipeline than the main `backend/process.py`, with multi-step analysis, PostgreSQL persistence, and structured data models. + +## `playground/process.py` + +Multi-pass GPT-4 document processing for financial/legal document analysis (e.g., credit agreements, lending documents). + +### Architecture + +Two-step pipeline with concurrent processing: + +**Step 1:** Process document pages through GPT-4 to extract structured data +- Reads text files split by ``) | + +**`experiments.in_out_files`** +| Column | Type | Description | +|--------|------|-------------| +| `id` | SERIAL PRIMARY KEY | File ID | +| `document_id` | INTEGER (FK) | Reference to documents table | +| `file_type` | VARCHAR(50) | `in` or `out` | +| `file_content` | TEXT | Prompt input or GPT response | + +### Functions + +- `init_schema()` -- Creates schema and tables +- `process_and_insert_txt_to_db(filename)` -- Reads a tagged text file, extracts tags, stores in `documents_tags` +- `insert_in_out_table(prompt_path, document_id)` -- Imports `.in.txt` and `.out.json` files +- `insert_metrics(id)` -- Imports `metrics.csv` into the database +- `insert_pass1_results(id)` / `insert_pass2_results(id)` -- Imports processing results + +## `playground/df_psql.py` + +Shared PostgreSQL utilities. Provides: +- SQLAlchemy engine setup from `POSTGRES_DB_CREDENTIALS` environment variable +- Helper functions for reading/writing DataFrames to PostgreSQL diff --git a/docs/pose-detection.md b/docs/pose-detection.md new file mode 100644 index 0000000..1a57492 --- /dev/null +++ b/docs/pose-detection.md @@ -0,0 +1,68 @@ +# Pose Detection (`detect_pose_mediapipe.py`) + +MediaPipe-based pose landmark detection tool. Detects up to 10 poses per image, generates annotated images, and saves landmark data as JSON. + +## Usage + +```bash +# Single image +python backend/detect_pose_mediapipe.py --image path/to/image.jpg + +# Batch processing with glob pattern +python backend/detect_pose_mediapipe.py --pattern "./*.jpg" +``` + +## CLI Options + +| Flag | Description | +|------|-------------| +| `--image` | Path to a single image file | +| `--pattern` | Glob pattern for batch processing (e.g., `./*.jpg`) | + +One of `--image` or `--pattern` is required. + +## Model Configuration + +- Model: `pose_landmarker_lite.task` (must be present in the working directory) +- Running mode: `IMAGE` (single image, not video/live stream) +- Segmentation masks: enabled +- Maximum poses per image: 10 + +## Output + +For each image where landmarks are detected: + +### Annotated Image + +Saved as `{original_name}.pose.jpg`. The original image is overlaid with pose landmark connections drawn using MediaPipe's default pose landmark style. + +### JSON Landmarks + +Saved as `{original_name}.json`. Contains all detected landmark coordinates: + +```json +{ + "landmarks": [ + { + "x": 0.543, + "y": 0.312, + "z": -0.05, + "visibility": 0.98 + } + ] +} +``` + +Each pose has 33 landmarks (MediaPipe Pose standard). Multiple poses are flattened into a single `landmarks` array. + +- `x`, `y` -- Normalized coordinates (0.0 to 1.0 relative to image dimensions) +- `z` -- Depth estimate relative to the hip midpoint +- `visibility` -- Confidence that the landmark is visible (0.0 to 1.0) + +## Batch Processing + +When using `--pattern`, files are: +1. Matched with `glob.glob` +2. Filtered to exclude files containing `.mask.` or `.pose.` in the filename (avoids reprocessing outputs) +3. Sorted alphabetically +4. Processed sequentially diff --git a/docs/setup.md b/docs/setup.md new file mode 100644 index 0000000..99f50d1 --- /dev/null +++ b/docs/setup.md @@ -0,0 +1,135 @@ +# Setup & Configuration + +## Python Dependencies + +```bash +pip install -r requirements.txt +``` + +Core dependencies from `requirements.txt`: +- `pandas==2.1.3` +- `sqlalchemy==2.0.27` +- `python-dotenv==1.0.0` +- `openai==1.9.0` +- `psycopg2-binary==2.9.9` +- `pydantic==2.5.3` +- `more-itertools==10.2.0` +- `PyMuPDF==1.23.26` + +Additional dependencies not listed in `requirements.txt` (install separately): +- `torch` + `clip` (for CLIP embeddings) +- `mediapipe` (for MediaPipe embeddings and pose detection) +- `flask` + `flask-cors` (for the API server) +- `clickhouse-driver` (for ClickHouse integration) +- `weaviate-client` (for Weaviate integration) +- `numpy`, `Pillow`, `opencv-python` (for image processing) +- `scikit-learn`, `plotly` (for visualization) +- `transformers` (for classify.py) + +## Frontend Dependencies + +```bash +cd frontend +npm install +``` + +See [frontend.md](frontend.md) for the full dependency list. + +## Environment Variables + +### Backend (`backend/.env`) + +Copy `backend/.env.example` to `backend/.env` and configure: + +| Variable | Example | Used By | +|----------|---------|---------| +| `CH_HOST` | `localhost` | `clickhouse.py` | +| `CH_PORT` | `9000` | `clickhouse.py` | +| `CH_USER` | `default` | `clickhouse.py` | +| `CH_PASSWORD` | `foobar` | `clickhouse.py` | +| `CH_DATABASE` | `db` | `clickhouse.py` | +| `MOUNT` | `/mnt/data` | `clip_app.py` | +| `OPENAI_API_KEY` | `sk-...` | `process.py`, playground scripts | + +### Frontend (`frontend/.env`) + +Copy `frontend/.env.example` to `frontend/.env`: + +| Variable | Example | Description | +|----------|---------|-------------| +| `BACKEND` | `0.0.0.0` | Backend host address | + +Note: The frontend currently hardcodes `"localhost"` in `src/env.ts` rather than reading from the environment variable. + +### Playground + +The playground scripts require additional environment variables: + +| Variable | Format | Used By | +|----------|--------|---------| +| `POSTGRES_DB_CREDENTIALS` | PostgreSQL connection string | `playground/process.py`, `playground/df_psql.py` | +| `MODEL_NAME` | e.g., `gpt-4-turbo-preview` | `playground/process.py` | +| `WEAVIATE_CREDENTIALS` | JSON: `{"URL":"...","API_KEY":"...","HTTP_SCHEME":"http"}` | `playground/postgres.py` | +| `POSTGRES_CREDENTIALS` | JSON: `{"NAME":"...","USER":"...","PASSWORD":"...","HOST":"...","PORT":"5432"}` | `playground/postgres.py` | + +## Database Setup + +### Weaviate + +Weaviate must be running and accessible. Default expected at `localhost:8080`. + +The `FileEmbedding` class is created automatically by the Weaviate client when storing embeddings. No manual schema setup is required. + +### ClickHouse + +Create the vector storage table: + +```sql +CREATE TABLE vector_storage ( + embedding_vector Array(UInt8), + md5_hash FixedString(32), + file_name String +) ENGINE = MergeTree() +ORDER BY md5_hash; +``` + +### PostgreSQL (Playground) + +Initialize the schema using the playground utility: + +```python +from playground.postgres import init_schema +init_schema() # Creates experiments schema and tables +``` + +This creates: +- `experiments.documents` -- Document metadata +- `experiments.documents_tags` -- Tagged text segments +- `experiments.in_out_files` -- Prompt inputs/outputs + +The `playground/process.py` pipeline also creates additional tables: +- `experiments.prompts` -- Stored prompt templates +- `experiments.promptios` -- Prompt I/O logs with metrics +- `experiments.pass1_results` -- Step 1 processing results +- `experiments.pass2_results` -- Step 2 processing results + +### SQLite + +Created automatically by `embedding.py`. Default file: `embedding.sqlite`. No manual setup needed. + +## Docker Deployment (Frontend) + +```bash +cd frontend +docker build -t embedding-frontend . +docker run -p 8080:80 embedding-frontend +``` + +This builds the React app and serves it via Nginx on port 80 (mapped to host port 8080). + +## Running the System + +1. Start Weaviate (port 8080) +2. Start the Flask API: `python backend/clip_app.py` (port 5000) +3. Start the frontend: `cd frontend && npm run dev` +4. Open the frontend in a browser and enter text queries to search for similar images diff --git a/docs/vector-databases.md b/docs/vector-databases.md new file mode 100644 index 0000000..cf12344 --- /dev/null +++ b/docs/vector-databases.md @@ -0,0 +1,90 @@ +# Vector Database Integrations + +The system supports three vector storage backends: ClickHouse, Weaviate, and SQLite. + +## ClickHouse (`clickhouse.py`) + +Stores embedding vectors with file metadata using the `clickhouse-driver` Python client. + +### Schema + +```sql +CREATE TABLE vector_storage ( + embedding_vector Array(UInt8), + md5_hash FixedString(32), + file_name String +) ENGINE = MergeTree() +ORDER BY md5_hash; +``` + +### Operations + +- `insert_into_db(client, embedding_vector, file_name)` -- Inserts an embedding vector, auto-computes MD5 hash from the file +- `get_md5_hash(file_name)` -- Generates MD5 hash for a file + +### Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `CH_HOST` | `localhost` | ClickHouse server host | +| `CH_PORT` | `9000` | ClickHouse native protocol port | +| `CH_USER` | `default` | Database user | +| `CH_PASSWORD` | `""` | Database password | +| `CH_DATABASE` | `default` | Database name | + +### Verification + +```bash +clickhouse-client --user --password --database +SELECT * FROM vector_storage; +``` + +## Weaviate (`use_weaviate.py`) + +Full-featured vector database integration using the `WeaviateEmbeddingStore` class. + +### Class: `WeaviateEmbeddingStore` + +Manages a `FileEmbedding` class in Weaviate. + +**Schema properties:** +- `fileName` (string) -- Name of the source file +- `md5Hash` (string) -- MD5 hash of the file +- `fileSize` (int) -- File size in bytes +- `modelName` (string) -- Name of the embedding model used + +**Methods:** + +| Method | Description | +|--------|-------------| +| `store_embedding(file_name, md5_hash, file_size, model_name, embedding)` | Store an embedding with metadata | +| `get_nearest_neighbors(embedding, max_results=5)` | Find nearest neighbors by vector similarity | +| `check_md5_exists(md5_hash, model_name)` | Check if an MD5/model combination exists | +| `check_multiple_md5_exists(md5_hashes, model_name)` | Batch existence check | +| `delete_embedding(md5_hash)` | Delete embeddings by MD5 hash | +| `delete_all_embeddings()` | Delete the entire `FileEmbedding` class | + +### Querying + +Weaviate supports both the Python client API and raw GraphQL: + +```graphql +{ + Get { + FileEmbedding { + fileName + md5Hash + fileSize + modelName + } + } +} +``` + +The frontend queries Weaviate directly via GraphQL through the Apollo Client and the `weaviate-ts-client`. + +## SQLite (via `embedding.py`) + +Local storage for embedding results. See [embedding.md](embedding.md) for the schema and usage details. + +The SQLite database (`embedding.sqlite` by default) stores embeddings as text-serialized float arrays alongside file metadata (filename, size, MD5 hash, computation time, method).