PDF2LLM

Extract highlighted text from PDF files and convert to clean Markdown format, perfect for use with LLMs like NotebookLM.

Features

🎯 Single File Processing - Extract highlights from a single PDF
📁 Batch Processing - Process all PDFs in a folder
🎨 Universal Compatibility - Supports annotations from mainstream PDF readers:
- Adobe Acrobat Reader
- Foxit PDF Editor/Reader
- PDF-XChange Editor
- Nitro PDF
- macOS Preview
- And other PDF readers following the PDF annotation standard
📝 Clean Markdown Output - LLM-friendly format with page grouping
🖱️ Easy to Use - Drag-and-drop batch files for Windows
✨ LLM Polish - Fix OCR errors using AI (Google Gemini, OpenAI, Anthropic)

Installation

Make sure Python 3.10+ is installed
Install dependencies:

pip install -r requirements.txt

Usage

Command Line

# Single file
python -m pdf2llm input.pdf

# Single file with custom output directory
python -m pdf2llm input.pdf -o ./output/

# Batch process a folder
python -m pdf2llm ./pdfs/ --batch

# Batch process including subfolders
python -m pdf2llm ./pdfs/ --batch --recursive

LLM Polish (Fix OCR Errors)

Use the --polish flag to clean up OCR errors using an LLM:

# Polish with default model (Google Gemini 3 Flash - free tier)
python -m pdf2llm input.pdf --polish

# Polish with a specific model
python -m pdf2llm input.pdf --polish --model gpt-4o-mini
python -m pdf2llm input.pdf --polish --model claude-3-haiku-20240307

When using --polish, two files are created for comparison:

input-ocr.md - Raw extracted text (before LLM polish)
input-llm.md - Polished text (after LLM polish)

This allows you to verify that no content was lost during polishing.

Setup API Key

Copy .env.example to .env
Add your API key:

# Google Gemini (free tier available)
# Get your key at: https://aistudio.google.com/apikey
GOOGLE_API_KEY=your_key_here

Supported Models

Provider	Models	Free Tier
Google	gemini-3-flash-preview, gemini-2.5-flash, gemini-2.0-flash	✅ Yes
OpenAI	gpt-4o, gpt-4o-mini, gpt-3.5-turbo	❌ No
Anthropic	claude-3-5-sonnet, claude-3-haiku	❌ No

Drag and Drop (Windows)

Single PDF: Drag a PDF file onto extract.bat
Folder: Drag a folder onto extract_folder.bat

The Markdown files will be created in the same location as the source PDFs.

Output Example

Given a PDF with highlights, the output will look like:

# Biology_Chapter_5

## Page 3
- Photosynthesis is the process by which plants convert sunlight into energy.
- Chlorophyll is the green pigment responsible for absorbing light.

## Page 7
- The mitochondria is the powerhouse of the cell.
- ATP is the primary energy currency of cells.

Workflow

Open your PDF in any supported PDF reader
Highlight the important sections using the highlight tool
Save the PDF
Run PDF2LLM on the file (add --polish to fix OCR errors)
Upload the generated .md file to NotebookLM or your preferred LLM

License

MIT License - See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF2LLM

Features

Installation

Usage

Command Line

LLM Polish (Fix OCR Errors)

Setup API Key

Supported Models

Drag and Drop (Windows)

Output Example

Workflow

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PDF2LLM

Features

Installation

Usage

Command Line

LLM Polish (Fix OCR Errors)

Setup API Key

Supported Models

Drag and Drop (Windows)

Output Example

Workflow

License