Skip to content

Vukotije/pdf-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 PDF Search Engine

A Python-based search engine for PDF documents, built as a university project for the Data Structures and Algorithms course.
The project demonstrates practical use of data structures (Trie, Stack) and algorithms (PageRank, Boolean expression parsing) in real-world search applications.


📝 Overview

This system allows users to:

  • Search for words and expressions inside PDF files
  • Use Boolean operators (AND, OR, NOT)
  • Get autocomplete suggestions using a * wildcard
  • Rank results using a PageRank algorithm
  • Receive "Did You Mean?" spelling suggestions
  • Export search results to a formatted PDF

✨ Features

Feature Description
🔍 Fast Search Trie data structure enables O(m) lookup (m = word length)
📊 Smart Ranking PageRank determines relevance of PDF pages
🧠 Boolean Search Supports AND, OR, NOT, and parentheses
🔤 Autocomplete Use algo* to list possible word completions
🤖 Spell Suggestions Automatic “Did you mean?” correction
💾 Caching Stores processed results for faster startup
📄 PDF Export Save the search results into a generated PDF

📁 Project Structure

pdf-search-engine/
├── data/                 # PDF input files
├── serialized_data/      # Cached data (auto-generated)
└── src/
    ├── main.py           # Application entry point
    ├── config.py         # Settings and constants
    ├── trie.py           # Trie implementation
    ├── pagerank.py       # PageRank algorithm
    ├── search_engine.py  # Search orchestration logic
    ├── stack_and_postfix.py # Boolean expression parser (Shunting Yard)
    └── utils.py          # Helper functions (PDF parsing, caching, etc.)

🚀 Installation & Running

Prerequisites

  • Python 3.8+
  • pip package manager

Setup

git clone <repo-url>
cd pdf-search-engine

python -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

pip install -r requirements.txt

Run the Program

python src/main.py

🔎 Usage Examples

Query Type Example Meaning
Simple Search algorithm Find all pages containing the word
Autocomplete algo* List word completions starting with algo
Boolean AND sorting AND algorithm Pages containing both words
Boolean OR tree OR graph Pages containing at least one word
Boolean NOT data NOT structure Pages with data but not structure
Grouping (tree OR graph) AND algorithm Complex prioritized queries

⚙️ How It Works

  1. PDF Parsing — Extracts text using PyMuPDF
  2. Trie Construction — Stores words for fast prefix-based searching
  3. PageRank Processing — Assigns relevance based on cross-page references
  4. Boolean Logic Evaluation — Uses the Shunting Yard algorithm to parse expressions
  5. Output Formatting — Displays ranked results and allows PDF export

🧠 Data Structures & Algorithms

Component Purpose
Trie Efficient word lookup and autocomplete
Stack Parsing and evaluating Boolean expressions
Graph + PageRank Ranking relevance of pages
Shunting Yard Algorithm Converting infix to postfix expressions

🚧 Future Improvements

  • Support multiple PDF files at once
  • Add full sentence / phrase search
  • Use Levenshtein Distance for improved autocorrect
  • Build simple web UI frontend

🎓 Course Information

  • Course: Data Structures and Algorithms
  • Level: 1st Year University
  • Focus: Practical application of Trie, Stack, and PageRank

About

A Python-based PDF search engine that uses a Trie and PageRank to provide fast text search, autocomplete, and ranked query results with Boolean logic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages