Skip to content

Latest commit

Β 

History

History
118 lines (83 loc) Β· 3.16 KB

File metadata and controls

118 lines (83 loc) Β· 3.16 KB

PDF2LLM

Extract highlighted text from PDF files and convert to clean Markdown format, perfect for use with LLMs like NotebookLM.

Features

  • 🎯 Single File Processing - Extract highlights from a single PDF
  • πŸ“ Batch Processing - Process all PDFs in a folder
  • 🎨 Universal Compatibility - Supports annotations from mainstream PDF readers:
    • Adobe Acrobat Reader
    • Foxit PDF Editor/Reader
    • PDF-XChange Editor
    • Nitro PDF
    • macOS Preview
    • And other PDF readers following the PDF annotation standard
  • πŸ“ Clean Markdown Output - LLM-friendly format with page grouping
  • πŸ–±οΈ Easy to Use - Drag-and-drop batch files for Windows
  • ✨ LLM Polish - Fix OCR errors using AI (Google Gemini, OpenAI, Anthropic)

Installation

  1. Make sure Python 3.10+ is installed
  2. Install dependencies:
pip install -r requirements.txt

Usage

Command Line

# Single file
python -m pdf2llm input.pdf

# Single file with custom output directory
python -m pdf2llm input.pdf -o ./output/

# Batch process a folder
python -m pdf2llm ./pdfs/ --batch

# Batch process including subfolders
python -m pdf2llm ./pdfs/ --batch --recursive

LLM Polish (Fix OCR Errors)

Use the --polish flag to clean up OCR errors using an LLM:

# Polish with default model (Google Gemini 3 Flash - free tier)
python -m pdf2llm input.pdf --polish

# Polish with a specific model
python -m pdf2llm input.pdf --polish --model gpt-4o-mini
python -m pdf2llm input.pdf --polish --model claude-3-haiku-20240307

When using --polish, two files are created for comparison:

  • input-ocr.md - Raw extracted text (before LLM polish)
  • input-llm.md - Polished text (after LLM polish)

This allows you to verify that no content was lost during polishing.

Setup API Key

  1. Copy .env.example to .env
  2. Add your API key:
# Google Gemini (free tier available)
# Get your key at: https://aistudio.google.com/apikey
GOOGLE_API_KEY=your_key_here

Supported Models

Provider Models Free Tier
Google gemini-3-flash-preview, gemini-2.5-flash, gemini-2.0-flash βœ… Yes
OpenAI gpt-4o, gpt-4o-mini, gpt-3.5-turbo ❌ No
Anthropic claude-3-5-sonnet, claude-3-haiku ❌ No

Drag and Drop (Windows)

  1. Single PDF: Drag a PDF file onto extract.bat
  2. Folder: Drag a folder onto extract_folder.bat

The Markdown files will be created in the same location as the source PDFs.

Output Example

Given a PDF with highlights, the output will look like:

# Biology_Chapter_5

## Page 3
- Photosynthesis is the process by which plants convert sunlight into energy.
- Chlorophyll is the green pigment responsible for absorbing light.

## Page 7
- The mitochondria is the powerhouse of the cell.
- ATP is the primary energy currency of cells.

Workflow

  1. Open your PDF in any supported PDF reader
  2. Highlight the important sections using the highlight tool
  3. Save the PDF
  4. Run PDF2LLM on the file (add --polish to fix OCR errors)
  5. Upload the generated .md file to NotebookLM or your preferred LLM

License

MIT License - See LICENSE for details.