VEIL - Encrypted Document Intelligence

Search millions of sensitive documents without exposing your sources.

Live Demo • Features • Architecture • Quick Start

Screenshots

Document Upload & Processing

Chunk Viewer with Citations

The Problem: Investigative Journalism Needs Secure Document Analysis

Investigative journalists often work with highly sensitive leaked documents like court filings, financial records, whistleblower evidence, and classified materials. Traditional document analysis tools store your data in plaintext, creating massive security risks:

Source Exposure: Unencrypted databases can be subpoenaed or hacked
Legal Liability: Storing sensitive documents in the clear creates liability
Chilling Effect: Journalists self-censor knowing their data isn't protected

VEIL solves this with end-to-end encrypted vector search powered by CyborgDB.

Why Encryption Matters for Journalism

Traditional RAG Systems

Documents are converted to plaintext chunks and stored in a vector database. Anyone with database access can read your sources.

VEIL with CyborgDB

Documents are converted to encrypted chunks and stored as encrypted vectors. Even with database access, data is unreadable without keys.

CyborgDB's Encryption Guarantees

Security Feature	Description
Client-side Encryption	Vectors are encrypted BEFORE leaving your machine
Zero-Knowledge Architecture	CyborgDB cannot read your data ever
AES-256 Encryption	Military-grade encryption for vector indices
Per-Session Keys	Each investigation session uses unique encryption keys
Encrypted Search	Search queries are performed on encrypted data

"With VEIL, journalists can analyze thousands of leaked documents with the same security guarantees as Signal messaging."

Features

Multi-Format Document Support

PDFs with page-level citations (OCR-enabled)
Word Documents (.docx)
Spreadsheets (.xlsx, .csv)
Presentations (.pptx)
Images with vision AI extraction
Plain text files

Intelligent RAG Pipeline

Semantic chunking with overlap for context preservation
Dynamic TopK calculation based on document count
Multi-document parallel search across entire case files
Page-level citation tracking for PDF documents
Ranked context retrieval with relevance scoring

Real-Time Chat Interface

Streaming responses from llm
Session-based conversations with full history
Inline citations linking directly to source pages
Chunk viewer for examining exact source text

Processing Transparency

Live engine events showing every processing step
Upload progress tracking for large documents
Vector storage confirmations with index names
Search result previews with chunk counts

Technical Architecture

flowchart TB
    subgraph Frontend["Frontend - Next.js 15 + React 19 + TailwindCSS"]
        SM[Session Manager]
        CU[Chat UI with Streaming]
        FP[File Panel]
        CV[Chunk Viewer with Citations]
    end

    subgraph Backend["Backend - Express 5 + TypeScript + Prisma"]
        subgraph RAG["RAG Orchestration Layer"]
            IS["Ingestion Service<br/>PDF→MD, Chunking, Embedding"]
            RS["Retrieval Service<br/>Dynamic K, Multi-doc, Ranking"]
            GS["Generation Service<br/>LLM, Streaming"]
        end
        
        subgraph Cyborg["CyborgDB Layer - Encrypted"]
            IDX["Index Service<br/>IVFFlat, 768-dim, Per-session"]
            STO["Storage Service<br/>Encrypted Upsert, Batch Ops"]
            SCH["Search Service<br/>Encrypted Query, Top-K"]
        end
    end

    subgraph Python["Python Service - FastAPI + MarkItDown"]
        PP[PDF Parsing with OCR]
        PM[Page Marker Injection]
        MF[Multi-format Support]
        GV[Multimodal LLM for Images]
    end

    subgraph Ollama["Ollama - Local Embeddings"]
        NE["nomic-embed-text<br/>768 dimensions<br/>100% Local"]
    end

    subgraph Databases["Databases"]
        PG[(PostgreSQL<br/>Metadata)]
        CDB[(CyborgDB<br/>Encrypted Vectors)]
    end

    Frontend -->|SSE + REST API| Backend
    IS --> Python
    IS --> Ollama
    RAG --> Cyborg
    Cyborg --> Databases

Data Flow

sequenceDiagram
    participant U as User
    participant F as Frontend
    participant B as Backend
    participant P as Python Service
    participant O as Ollama
    participant C as CyborgDB
    participant G as LLM

    U->>F: Upload Document
    F->>B: POST /upload
    B->>P: Convert to Markdown
    P-->>B: Markdown with Page Markers
    B->>B: Chunk Document
    B->>O: Generate Embeddings
    O-->>B: 768-dim Vectors
    B->>C: Store Encrypted Vectors
    C-->>B: Confirmation
    
    U->>F: Ask Question
    F->>B: POST /chat
    B->>O: Embed Query
    O-->>B: Query Vector
    B->>C: Encrypted Vector Search
    C-->>B: Top-K Results
    B->>G: Generate Response with Context
    G-->>B: Streaming Response
    B-->>F: SSE Stream
    F-->>U: Display with Citations

RAG Techniques Used

1. Semantic Chunking with Overlap

Documents are split into 2000-character chunks with 200-character overlap to preserve context across boundaries. The chunking algorithm uses priority-based separators (paragraphs, lines, sentences, words) to find natural break points.

2. Dynamic TopK Retrieval

The system automatically adjusts the number of chunks retrieved based on the document count to optimize context quality within token limits.

Documents	TopK Per Doc	Total Context
1	15	~7,500 tokens
5	10	~25,000 tokens
10+	8	~40,000 tokens

3. Parallel Multi-Document Search

When querying across multiple documents, VEIL performs concurrent vector searches against each document's chunks, then ranks and merges results by relevance score.

4. Page-Level Citation Tracking

PDF page markers are extracted during parsing and preserved through the chunking process. The LLM injects citations in the format [SOURCE: document.pdf | Page 12] which link directly to the source in the chunk viewer.

5. Normalized Vector Embeddings

All embeddings are L2-normalized before storage to ensure consistent cosine similarity scoring during retrieval.

Performance & Scale

Metric	Value
Token Budget	150,000 tokens per query
Max Chunks per Query	50 chunks
Max Chunks per Document	15 chunks
Embedding Dimension	768 (nomic-embed-text)
Embedding Batch Size	5 concurrent
Vector Upsert Batch	50 vectors per operation
Streaming Latency	<100ms first token

Document Processing Benchmarks

Document Type	Size	Processing Time
PDF (10 pages)	~500KB	~3s
PDF (100 pages)	~5MB	~15s
Large legal filing	~50MB	~60s
DOCX with images	~10MB	~8s

Run Entirely Locally

VEIL is designed to run completely air-gapped for maximum security. All services run in Docker:

CyborgDB: Encrypted vector storage (local)
Ollama: nomic-embed-text embeddings (local)
PostgreSQL: Metadata storage (local)
Python service: Document parsing (local)
Backend: Express API (local)
Frontend: Next.js UI (local)

Why Local Matters for Journalists

Concern	Cloud Solution	VEIL Local
Subpoena Risk	High - Provider logs everything	None - No external records
Network Sniffing	Encrypted but metadata exposed	No network traffic
Third-Party Access	Provider ToS allows access	You control the keys
Data Residency	Unknown jurisdiction	Your machine only

Quick Start

Prerequisites

Docker & Docker Compose
Node.js 20+ (for development)

1. Clone & Configure

Clone the repository and create your environment file with the required configuration:

ENCRYPTION_KEY: A base64-encoded 32-byte key for CyborgDB encryption
DATABASE_URL: PostgreSQL connection string

2. Generate Encryption Key

Generate a secure 32-byte encryption key using Node.js crypto module and encode it as base64.

3. Launch

Start all services with Docker Compose and access the application at http://localhost:3000.

4. Development Mode

For local development, run the backend, frontend, and Python service separately with their respective dev commands.

Tech Stack

Layer	Technology
Frontend	Next.js 15, React 19, TailwindCSS, Radix UI
Backend	Express 5, TypeScript, Prisma ORM
Vector DB	CyborgDB (encrypted)
Embeddings	Ollama + nomic-embed-text (768-dim)
LLM	Ollama - Llama 4 Maverick
Document Parsing	FastAPI + MarkItDown + PyMuPDF
Database	PostgreSQL 16
Auth	JWT + Google OAuth

Team

Built by:

Arpan Taneja
Ashish K. Chowdhary
Pratham Gupta
Himanshu Gupta

License

MIT License - See LICENSE for details.

Try VEIL Live

Protecting sources. Exposing truth.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
backend		backend
frontend		frontend
screenshots		screenshots
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

VEIL - Encrypted Document Intelligence

Screenshots

Document Upload & Processing

Chunk Viewer with Citations

The Problem: Investigative Journalism Needs Secure Document Analysis

Why Encryption Matters for Journalism

Traditional RAG Systems

VEIL with CyborgDB

CyborgDB's Encryption Guarantees

Features

Multi-Format Document Support

Intelligent RAG Pipeline

Real-Time Chat Interface

Processing Transparency

Technical Architecture

Data Flow

RAG Techniques Used

1. Semantic Chunking with Overlap

2. Dynamic TopK Retrieval

3. Parallel Multi-Document Search

4. Page-Level Citation Tracking

5. Normalized Vector Embeddings

Performance & Scale

Document Processing Benchmarks

Run Entirely Locally

Why Local Matters for Journalists

Quick Start

Prerequisites

1. Clone & Configure

2. Generate Encryption Key

3. Launch

4. Development Mode

Tech Stack

Team

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages