Skip to content

TsekaLuk/pdf-annotation-reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-annotation-reader

Stop squinting at red underlines. One command turns a "reviewed" PDF into a clean checklist of what the reviewer actually wrote.

Python 3.9+ License: MIT Claude Code Skill


The problem (you've felt this)

Your advisor / lawyer / editor sends back a PDF covered in red marks. You open it and see... only the marks. The actual comments — "change this to that", "rewrite this paragraph", "the equation is wrong" — are hidden inside annotation popups that never get printed and rarely get exported.

So you either:

  1. 🔁 Click every single mark to read each tooltip (and lose your place)
  2. 📞 Email back asking "can you send the comments as text?"
  3. 🤞 Guess what they meant from the highlight color alone

There's a better way. Reviewer comments live in the PDF as structured Annot objects. We just have to read them.

What this does

$ python extract_annotations.py thesis-reviewed.pdf

===== Page 1 =====
  [1] Underline by 23727
      underlined: 'Lianyungang'
      ★ comment : '后面加City'

===== Page 3 =====
  [1] Underline by 23727
      underlined: '构建'
      ★ comment : '改成使用'

===== Page 24 =====
  [1] Underline by 23727
      underlined: 'x(0)(k) + az(1)(k) = b'
      ★ comment : '微分方程写的不对'

Total: 23 annotations

Now you have an actionable list. Fix → rebuild → re-run to confirm zero remain.

Install

git clone https://github.com/TsekaLuk/pdf-annotation-reader.git
cd pdf-annotation-reader
pip install -r requirements.txt   # just pymupdf

Or one-liner, no clone:

pip install pymupdf
curl -O https://raw.githubusercontent.com/TsekaLuk/pdf-annotation-reader/main/extract_annotations.py
python extract_annotations.py your.pdf

Usage

# Default: terminal-friendly text
python extract_annotations.py reviewed.pdf

# Markdown table — paste into a report, Notion, or PR description
python extract_annotations.py reviewed.pdf --markdown > review.md

# JSON for tooling (LLM input, ticket creation, diff generation)
python extract_annotations.py reviewed.pdf --json > review.json

Markdown output looks like this

Page Type Underlined Comment Author
1 Underline Lianyungang 后面加City 23727
3 Underline 构建 改成使用 23727
24 Underline x⁽⁰⁾(k) + az⁽¹⁾(k) = b 微分方程写的不对 23727

Why not just OCR / use a vision LLM?

Approach Speed Cost Reliability
Vision LLM ("what's in this PDF?") 🐢 slow 💸 per-token 🎲 hallucinates underlines that don't exist
OCR + heuristics 🐢 slow 💵 🎲 misses popup comments entirely
This tool (PyMuPDF annotation API) ⚡️ instant 🆓 free 🎯 reads the source metadata directly

The annotation popup text is already stored in the PDF as plain Unicode. There's no inference involved.

Use as a Claude Code Skill

This repo ships a SKILL.md so Claude Code (and any tool that loads SKILL.md frontmatter) can invoke it as a first-class skill.

# in your ~/.claude/skills/ or project skills dir:
git clone https://github.com/TsekaLuk/pdf-annotation-reader.git

Then ask Claude: "What are the reviewer's comments on paper-reviewed.pdf?" — the skill auto-triggers.

Edge cases handled

  • Multi-line underlines (reconstructed from per-line quad points)
  • Author + timestamps preserved (handy for multi-reviewer documents)
  • Mixed annotation types (Underline, Highlight, StrikeOut, Sticky Note, FreeText)
  • Unicode text (Chinese, Japanese, Arabic — anything PyMuPDF can extract)
  • ⚠️ Ink / drawn annotations — no text to extract, only location is reported
  • ⚠️ Flattened PDFs — annotations were merged into the page graphics; nothing to read. Ask the sender to re-export with markup preserved.

Real-world story (where this came from)

Built in 30 minutes because an advisor returned a thesis as a PDF with 23 red underlines and zero visible comments. Tooltip-hovering 23 times to read the popups was unworkable; this script turned the lot into a markdown checklist in 0.4 seconds. The author then applied every fix in both the LaTeX source and the native Word version, then re-ran the extractor to confirm "zero annotations remain" before resubmitting.

Roadmap

  • PDF round-trip: apply changes and emit "addressed" stamps next to each annotation
  • --diff mode: pair annotations with the current source file and show what already matches
  • HTML report with anchored links back to the PDF
  • PyPI package

PRs welcome.

License

MIT — do whatever you want, attribution appreciated.

About

Extract reviewer comments hidden behind underlines/highlights in a PDF. One command turns a 'reviewed' PDF into a clean checklist. PyMuPDF-powered, Claude Code skill included.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages