Skip to content

Implements vector search through images and text in PDFs

Notifications You must be signed in to change notification settings

DanaY326/PDF_vector_search

 
 

Repository files navigation

PDF Vector Search application

Demo video

Project Overview

This project is a PDF viewer and vector search application that can search through both images and text using HuggingFace Transformers

Features

  • PDF document upload and display
  • Page navigation (next, previous, jump to specific page)
  • Zoom in/out functionality
  • Document information display (total pages, current page)
  • Indexing of text, including with OCR, and all images
  • Vector search across multiple PDFs
  • Text highlighting for search matches
  • Sidebar for search results and navigation
  • Responsive design for various screen sizes
  • Persistent storage of highlights using SQLite

Technologies Used

  • Next.js
  • React
  • TypeScript
  • react-pdf library for PDF rendering
  • Tailwind CSS for styling
  • SQLite for local highlight storage
  • HuggingFace for embedding data in preparation for vector search

Getting Started

  1. Clone the repository
  2. Install dependencies: pnpm install
  3. Run the development server: pnpm run dev
  4. Open http://localhost:3000 in your browser

Project Structure

  • app/page.js: Main entry point of the application
  • app/components/: React components for various parts of the application
  • app/utils/: Utility functions for PDF processing and highlight storage
  • app/styles/: CSS files for styling
  • app/api/: API routes for handling highlight operations

Key Components

  • App.tsx: Core application component
  • PdfViewer.tsx: Handles PDF rendering and navigation
  • KeywordSearch.tsx: Manages keyword search functionality
  • HighlightPopup.tsx: Displays information about highlighted text
  • Sidebar.tsx: Shows search results and navigation options
  • highlightStorage.ts: Manages highlight storage operations
  • clipText.ts and clipImage.ts: Handles embedding data

Features

  • Has a highlight storage system supporting both SQLite
  • Semantic search across images and text in multiple PDFs
  • API routes for creating, retrieving, updating, and deleting highlights
  • User authentication and document permissions (currently disabled)
  • Export/import as JSON functionality for highlights
  • Scroll the sidebar highlighted area into view across different PDFs.

Future Improvements

  • Implement annotation tools (e.g., freehand drawing, text notes)
  • Add support for multiple document search
  • Pre-process batch PDFs for quicker highlights
  • Enhance mobile responsiveness for better small-screen experience
  • Optimize performance for large PDF files
  • Upload the PDF into the database.

License

MIT License

Acknowledgements

About

Implements vector search through images and text in PDFs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 82.0%
  • CSS 10.7%
  • Shell 5.7%
  • JavaScript 1.6%