diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..e3af20a --- /dev/null +++ b/docs/README.md @@ -0,0 +1,12 @@ +# PDF Highlighter Documentation + +Comprehensive documentation for the PDF Highlighter web application. + +## Table of Contents + +- [Architecture](./architecture.md) — System design, data flow, component hierarchy, and storage abstraction +- [Setup Guide](./setup.md) — Prerequisites, installation, environment configuration, and running the app +- [API Reference](./api-reference.md) — All API routes with request/response schemas and examples +- [Components](./components.md) — React component documentation with props, behavior, and responsibilities +- [Utilities](./utilities.md) — Utility modules, classes, type definitions, and helper functions +- [Features](./features.md) — Feature walkthroughs for PDF upload, search, highlighting, OCR, and import/export diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 0000000..bb63aa8 --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,213 @@ +# API Reference + +All API routes are Next.js App Router route handlers located under `app/api/`. + +## `POST /api/highlight/get` + +Retrieve all highlights for a given PDF. + +**Source:** `app/api/highlight/get/route.ts` + +### Request + +```json +{ + "pdfId": "my_document__pdf" +} +``` + +The body is the `pdfId` string (sent directly as JSON, or as an object with a `pdfId` field depending on the storage method). The route handler reads `body.pdfId` for SQLite or passes the body directly to Supabase. + +### Response + +**200 OK** + +```json +[ + { + "id": "abc123", + "pdfId": "my_document__pdf", + "pageNumber": 1, + "x1": 72.5, + "y1": 100.2, + "x2": 200.3, + "y2": 115.8, + "width": 612, + "height": 792, + "text": "Found \"keyword\"", + "image": null, + "keyword": "keyword" + } +] +``` + +**500 Internal Server Error** + +```json +{ + "error": "Internal Server Error", + "details": "error message" +} +``` + +### Behavior + +- SQLite: Instantiates `HighlightStorage`, calls `getHighlightsForPdf(body.pdfId)`, then closes the database connection in a `finally` block. +- Supabase: Calls `supabaseGetHighlightsForPdf(body.pdfId)`. + +--- + +## `POST /api/highlight/update` + +Save one or more highlights. + +**Source:** `app/api/highlight/update/route.ts` + +### Request (SQLite — single highlight) + +```json +{ + "highlights": { + "id": "abc123", + "pdfId": "my_document__pdf", + "pageNumber": 1, + "x1": 72.5, + "y1": 100.2, + "x2": 200.3, + "y2": 115.8, + "width": 612, + "height": 792, + "text": "Found \"keyword\"", + "keyword": "keyword" + } +} +``` + +### Request (SQLite — bulk highlights) + +```json +{ + "pdfId": "my_document__pdf", + "highlights": [ + { + "id": "abc123", + "pdfId": "my_document__pdf", + "pageNumber": 1, + "x1": 72.5, + "y1": 100.2, + "x2": 200.3, + "y2": 115.8, + "width": 612, + "height": 792, + "text": "Found \"keyword\"", + "keyword": "keyword" + } + ] +} +``` + +### Request (Supabase — single or bulk) + +The body is the highlight object or array directly (no wrapping `highlights` key): + +```json +[ + { + "id": "abc123", + "pdfId": "my_document__pdf", + "pageNumber": 1, + "x1": 72.5, + "y1": 100.2, + "x2": 200.3, + "y2": 115.8, + "width": 612, + "height": 792, + "text": "Found \"keyword\"", + "keyword": "keyword" + } +] +``` + +### Response + +- **200 OK** — Empty body +- **500 Internal Server Error** — Empty body + +### Behavior + +- Detects single vs. bulk by checking `Array.isArray(body.highlights)` (SQLite) or `Array.isArray(body)` (Supabase). +- Ensures every highlight has a `keyword` field (defaults to `""` if missing). +- SQLite: Uses `INSERT OR REPLACE` (upsert) with transactions for bulk operations. +- Supabase: Uses `upsert()` for bulk and `insert()` for single. + +--- + +## `DELETE /api/highlight/update` + +Delete a single highlight. + +**Source:** `app/api/highlight/update/route.ts` + +### Request (SQLite) + +```json +{ + "pdfId": "my_document__pdf", + "id": "abc123" +} +``` + +### Request (Supabase) + +The body is the highlight ID string directly: + +```json +"abc123" +``` + +### Response + +- **200 OK** — Empty body +- **500 Internal Server Error** — Empty body + +### Behavior + +- SQLite: Deletes by composite key `(pdfId, id)`. +- Supabase: Deletes by `id` only. + +--- + +## `POST /api/index` + +Index OCR-extracted words for a PDF. Currently only supports SQLite. + +**Source:** `app/api/index/route.ts` + +### Request + +```json +{ + "pdfId": "my_document__pdf", + "words": [ + { + "keyword": "hello", + "x1": 50, + "y1": 100, + "x2": 120, + "y2": 115 + } + ] +} +``` + +### Response + +- **200 OK** — Empty body +- **500 Internal Server Error** — Empty body + +### Behavior + +- SQLite: Instantiates `HighlightStorage` and calls `indexWords(pdfId, words)`, which converts words to `StoredHighlight` objects with generated IDs and saves them in bulk. +- Supabase: Throws `"Index via supabase has not been implemented"`. + +> **Note:** This route is currently not called in the application (the code in `App.tsx` that would call it is commented out). diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..2a2e54d --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,191 @@ +# Architecture + +## High-Level Overview + +PDF Highlighter is a Next.js 14 application (App Router) that allows users to upload PDFs, search for keywords with automatic text highlighting, manually select areas, and persist highlights to a database. + +``` +┌─────────────────────────────────────────────────────┐ +│ Browser (Client) │ +│ │ +│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ │ +│ │PdfUploader│ │KeywordSearch │ │HighlightUpload│ │ +│ └─────┬────┘ └──────┬───────┘ └──────┬────────┘ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ App (orchestrator) │ │ +│ │ state: pdfUrl, highlights, searchTerm, │ │ +│ │ pdfId, loading, pdfOcrUrl │ │ +│ └──────────────────┬──────────────────────────┘ │ +│ │ │ +│ ┌────────────┼────────────┐ │ +│ ▼ ▼ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ +│ │PdfViewer │ │ Sidebar │ │ Spinner │ │ +│ │(react-pdf│ │(highlight│ │ │ │ +│ │-highlight│ │ list) │ │ │ │ +│ │er) │ │ │ │ │ │ +│ └──────────┘ └──────────┘ └─────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ pdfUtils (client-side) │ │ +│ │ searchPdf() ─ pdfjs-dist text extraction │ │ +│ │ convertPdfToImages() ─ canvas rendering │ │ +│ │ Tesseract.js OCR (in App component) │ │ +│ └──────────────────────────────────────────────┘ │ +└──────────────────────┬───────────────────────────────┘ + │ fetch() API calls + ▼ +┌─────────────────────────────────────────────────────┐ +│ Next.js API Routes │ +│ │ +│ POST /api/highlight/get ─ retrieve highlights │ +│ POST /api/highlight/update ─ save highlights │ +│ DELETE /api/highlight/update ─ delete highlight │ +│ POST /api/index ─ index OCR words │ +└──────────────────────┬───────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────┐ +│ Storage Abstraction Layer │ +│ │ +│ STORAGE_METHOD env var selects backend: │ +│ │ +│ ┌─────────────────┐ ┌──────────────────┐ │ +│ │ HighlightStorage│ │ supabase.ts │ │ +│ │ (SQLite wrapper)│ │ (Supabase client)│ │ +│ │ │ │ │ │ +│ │ ┌─────────────┐ │ │ saveHighlight() │ │ +│ │ │SQLiteDatabase│ │ │ saveBulkH...() │ │ +│ │ │highlights.db │ │ │ getHighlights() │ │ +│ │ └─────────────┘ │ │ deleteH...() │ │ +│ └─────────────────┘ └──────────────────┘ │ +└─────────────────────────────────────────────────────┘ +``` + +## Data Flow + +### Upload and OCR + +``` +User selects PDF file + │ + ▼ +App.handleFileUpload() + │ + ├─► URL.createObjectURL(file) → pdfUrl + │ + ├─► convertPdfToImages(file) + │ │ + │ ▼ + │ pdfjs-dist renders pages to + │ canvas.toDataURL() → base64 images + │ + ├─► Tesseract.js worker.recognize(image) + │ │ + │ ▼ + │ OCR output → new PDF blob → pdfOcrUrl + │ + ├─► getPdfId(filename, email?) → pdfId + │ + └─► fetch("/api/highlight/get") → load saved highlights +``` + +### Keyword Search + +``` +User enters keywords (pipe-separated: "word1|word2") + │ + ▼ +App.handleSearch() + │ + ├─► searchPdf(keywords, pdfUrl, zoom) + │ │ + │ ▼ + │ pdfjs-dist extracts text per page + │ Groups text items into lines (by y-coordinate) + │ Regex match keywords in each line + │ Calculate bounding box coordinates + │ Return IHighlight[] with positions + │ + ├─► If no results and pdfOcrUrl exists: + │ searchPdf(keywords, pdfOcrUrl, zoom) + │ + ├─► Merge new highlights with existing + │ + └─► POST /api/highlight/update → persist to DB +``` + +### Manual Area Selection + +``` +User holds Alt + clicks and drags on PDF + │ + ▼ +PdfHighlighter.enableAreaSelection(event.altKey) + │ + ▼ +onSelectionFinished(position, content) + │ + ▼ + component → user enters comment + │ + ▼ +Create IHighlight with area position + │ + ├─► POST /api/highlight/update → persist + └─► setHighlights([...prev, newHighlight]) +``` + +## Storage Abstraction + +The application supports two storage backends, selected via the `STORAGE_METHOD` environment variable: + +| Feature | SQLite | Supabase | +|---------|--------|----------| +| Setup | Zero-config, local file | Requires Supabase project | +| Location | `process.cwd()/highlights.db` | Cloud-hosted | +| Class/Module | `HighlightStorage` wrapping `SQLiteDatabase` | Individual exported functions | +| Word indexing | Supported | Not implemented | +| Export/Import | Client-side JSON | `exportToJson()` / `importFromJson()` server-side | + +API routes check `storageMethod` and delegate to the appropriate backend. The SQLite path instantiates `HighlightStorage` (which creates an `SQLiteDatabase`), while the Supabase path calls standalone functions from `supabase.ts`. + +## State Management + +The application uses React hooks exclusively (no external state library). All top-level state lives in the `App` component: + +| State | Type | Purpose | +|-------|------|---------| +| `pdfUploaded` | `boolean` | Whether a PDF has been uploaded | +| `pdfUrl` | `string \| null` | Object URL of the uploaded PDF | +| `pdfOcrUrl` | `string \| null` | Object URL of the OCR-processed PDF | +| `pdfName` | `string \| null` | Original filename | +| `pdfId` | `string \| null` | Derived identifier for DB storage | +| `searchTerm` | `string` | Current keyword search input | +| `highlights` | `IHighlight[]` | All current highlights | +| `highlightsKey` | `number` | Incremented to force `PdfHighlighter` re-render | +| `loading` | `boolean` | OCR processing indicator | + +State flows down via props. Child components call parent callbacks (e.g., `onFileUpload`, `handleSearch`, `setHighlights`) to update state. + +## Component Hierarchy + +``` +App +├── Header +├── PdfUploader +├── HighlightUploader (shown when pdfId exists) +├── KeywordSearch (shown when pdfUrl exists) +├── Spinner (shown during loading) +└── PdfViewer + ├── Sidebar + │ └── Button (delete per highlight) + └── PdfHighlighter (react-pdf-highlighter) + ├── PdfLoader + ├── Highlight / AreaHighlight + ├── Popup + │ └── HighlightPopup + └── Tip (on selection) +``` diff --git a/docs/components.md b/docs/components.md new file mode 100644 index 0000000..76c3f88 --- /dev/null +++ b/docs/components.md @@ -0,0 +1,281 @@ +# Components + +All components are located in `app/components/`. + +## App + +**File:** `app/components/App.tsx` + +The root orchestrator component. Manages all top-level application state and coordinates the upload, OCR, search, and highlight persistence workflows. + +### State + +| State | Type | Description | +|-------|------|-------------| +| `pdfUploaded` | `boolean` | Whether a PDF file has been uploaded | +| `pdfUrl` | `string \| null` | Object URL of the original uploaded PDF | +| `pdfOcrUrl` | `string \| null` | Object URL of the OCR-processed PDF | +| `pdfName` | `string \| null` | Original filename of the uploaded PDF | +| `pdfId` | `string \| null` | Derived identifier used for database storage | +| `searchTerm` | `string` | Current keyword search input | +| `highlights` | `IHighlight[]` | All active highlights | +| `highlightsKey` | `number` | Incremented on highlight changes to force PdfHighlighter re-render | +| `loading` | `boolean` | `true` during OCR processing | + +### Key Behaviors + +- On file upload: creates an object URL, runs OCR via Tesseract.js, generates a `pdfId`, and loads any saved highlights from the API. +- On search: splits the search term by `|` to support multiple keywords, calls `searchPdf()`, falls back to the OCR PDF if no results, merges results with existing highlights, and persists to the database. +- On highlight upload (JSON): reads the file, converts `StoredHighlight[]` to `IHighlight[]`, updates state, and persists to the database. +- Listens for `hashchange` events to scroll to highlights referenced by URL hash (`#highlight-`). + +--- + +## PdfViewer + +**File:** `app/components/PdfViewer.tsx` + +Renders the PDF document and handles highlight display and interaction using the `react-pdf-highlighter` library. + +### Props + +```typescript +interface PdfViewerProps { + pdfUrl: string | null; + pdfName: string | null; + pdfId: string | null; + highlights: Array; + setHighlights: React.Dispatch>>; + highlightsKey: number; + pdfViewerRef: React.RefObject; + resetHash: () => void; + scrollViewerTo: React.MutableRefObject<(highlight: IHighlight) => void>; + scrollToHighlightFromHash: () => void; +} +``` + +### Key Behaviors + +- When `pdfUrl` is `null`, displays a prompt to upload a PDF. +- Area selection is enabled by holding the `Alt` key (`enableAreaSelection: event.altKey`). +- On text/area selection finish, shows a `` component for adding a comment, then persists the highlight via the API. +- Renders text highlights with `` and area highlights with ``. +- Hover popups display the highlight comment via ``. +- Contains a collapsible `` for managing highlights. + +--- + +## PdfUploader + +**File:** `app/components/PdfUploader.tsx` + +File input component for uploading PDF files. + +### Props + +```typescript +interface PdfUploaderProps { + onFileUpload: (file: File) => void; + pdfUploaded: boolean; +} +``` + +### Key Behaviors + +- Renders a hidden `` with a styled `