📄 PDF Search Engine

A Python-based search engine for PDF documents, built as a university project for the Data Structures and Algorithms course.
The project demonstrates practical use of data structures (Trie, Stack) and algorithms (PageRank, Boolean expression parsing) in real-world search applications.

📝 Overview

This system allows users to:

Search for words and expressions inside PDF files
Use Boolean operators (AND, OR, NOT)
Get autocomplete suggestions using a * wildcard
Rank results using a PageRank algorithm
Receive "Did You Mean?" spelling suggestions
Export search results to a formatted PDF

✨ Features

Feature	Description
🔍 Fast Search	Trie data structure enables O(m) lookup (m = word length)
📊 Smart Ranking	PageRank determines relevance of PDF pages
🧠 Boolean Search	Supports `AND`, `OR`, `NOT`, and parentheses
🔤 Autocomplete	Use `algo*` to list possible word completions
🤖 Spell Suggestions	Automatic “Did you mean?” correction
💾 Caching	Stores processed results for faster startup
📄 PDF Export	Save the search results into a generated PDF

📁 Project Structure

pdf-search-engine/
├── data/                 # PDF input files
├── serialized_data/      # Cached data (auto-generated)
└── src/
    ├── main.py           # Application entry point
    ├── config.py         # Settings and constants
    ├── trie.py           # Trie implementation
    ├── pagerank.py       # PageRank algorithm
    ├── search_engine.py  # Search orchestration logic
    ├── stack_and_postfix.py # Boolean expression parser (Shunting Yard)
    └── utils.py          # Helper functions (PDF parsing, caching, etc.)

🚀 Installation & Running

Prerequisites

Python 3.8+
pip package manager

Setup

git clone <repo-url>
cd pdf-search-engine

python -m venv venv
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

pip install -r requirements.txt

Run the Program

python src/main.py

🔎 Usage Examples

Query Type	Example	Meaning
Simple Search	`algorithm`	Find all pages containing the word
Autocomplete	`algo*`	List word completions starting with `algo`
Boolean AND	`sorting AND algorithm`	Pages containing both words
Boolean OR	`tree OR graph`	Pages containing at least one word
Boolean NOT	`data NOT structure`	Pages with data but not structure
Grouping	`(tree OR graph) AND algorithm`	Complex prioritized queries

⚙️ How It Works

PDF Parsing — Extracts text using PyMuPDF
Trie Construction — Stores words for fast prefix-based searching
PageRank Processing — Assigns relevance based on cross-page references
Boolean Logic Evaluation — Uses the Shunting Yard algorithm to parse expressions
Output Formatting — Displays ranked results and allows PDF export

🧠 Data Structures & Algorithms

Component	Purpose
Trie	Efficient word lookup and autocomplete
Stack	Parsing and evaluating Boolean expressions
Graph + PageRank	Ranking relevance of pages
Shunting Yard Algorithm	Converting infix to postfix expressions

🚧 Future Improvements

Support multiple PDF files at once
Add full sentence / phrase search
Use Levenshtein Distance for improved autocorrect
Build simple web UI frontend

🎓 Course Information

Course: Data Structures and Algorithms
Level: 1st Year University
Focus: Practical application of Trie, Stack, and PageRank

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 PDF Search Engine

📝 Overview

✨ Features

📁 Project Structure

🚀 Installation & Running

Prerequisites

Setup

Run the Program

🔎 Usage Examples

⚙️ How It Works

🧠 Data Structures & Algorithms

🚧 Future Improvements

🎓 Course Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 PDF Search Engine

📝 Overview

✨ Features

📁 Project Structure

🚀 Installation & Running

Prerequisites

Setup

Run the Program

🔎 Usage Examples

⚙️ How It Works

🧠 Data Structures & Algorithms

🚧 Future Improvements

🎓 Course Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages