pdf-annotation-reader

Stop squinting at red underlines. One command turns a "reviewed" PDF into a clean checklist of what the reviewer actually wrote.

The problem (you've felt this)

Your advisor / lawyer / editor sends back a PDF covered in red marks. You open it and see... only the marks. The actual comments — "change this to that", "rewrite this paragraph", "the equation is wrong" — are hidden inside annotation popups that never get printed and rarely get exported.

So you either:

🔁 Click every single mark to read each tooltip (and lose your place)
📞 Email back asking "can you send the comments as text?"
🤞 Guess what they meant from the highlight color alone

There's a better way. Reviewer comments live in the PDF as structured Annot objects. We just have to read them.

What this does

$ python extract_annotations.py thesis-reviewed.pdf

===== Page 1 =====
  [1] Underline by 23727
      underlined: 'Lianyungang'
      ★ comment : '后面加City'

===== Page 3 =====
  [1] Underline by 23727
      underlined: '构建'
      ★ comment : '改成使用'

===== Page 24 =====
  [1] Underline by 23727
      underlined: 'x(0)(k) + az(1)(k) = b'
      ★ comment : '微分方程写的不对'

Total: 23 annotations

Now you have an actionable list. Fix → rebuild → re-run to confirm zero remain.

Install

git clone https://github.com/TsekaLuk/pdf-annotation-reader.git
cd pdf-annotation-reader
pip install -r requirements.txt   # just pymupdf

Or one-liner, no clone:

pip install pymupdf
curl -O https://raw.githubusercontent.com/TsekaLuk/pdf-annotation-reader/main/extract_annotations.py
python extract_annotations.py your.pdf

Usage

# Default: terminal-friendly text
python extract_annotations.py reviewed.pdf

# Markdown table — paste into a report, Notion, or PR description
python extract_annotations.py reviewed.pdf --markdown > review.md

# JSON for tooling (LLM input, ticket creation, diff generation)
python extract_annotations.py reviewed.pdf --json > review.json

Markdown output looks like this

Page	Type	Underlined	Comment	Author
1	Underline	`Lianyungang`	后面加City	23727
3	Underline	`构建`	改成使用	23727
24	Underline	`x⁽⁰⁾(k) + az⁽¹⁾(k) = b`	微分方程写的不对	23727

Why not just OCR / use a vision LLM?

Approach	Speed	Cost	Reliability
Vision LLM ("what's in this PDF?")	🐢 slow	💸 per-token	🎲 hallucinates underlines that don't exist
OCR + heuristics	🐢 slow	💵	🎲 misses popup comments entirely
This tool (`PyMuPDF` annotation API)	⚡️ instant	🆓 free	🎯 reads the source metadata directly

The annotation popup text is already stored in the PDF as plain Unicode. There's no inference involved.

Use as a Claude Code Skill

This repo ships a SKILL.md so Claude Code (and any tool that loads SKILL.md frontmatter) can invoke it as a first-class skill.

# in your ~/.claude/skills/ or project skills dir:
git clone https://github.com/TsekaLuk/pdf-annotation-reader.git

Then ask Claude: "What are the reviewer's comments on paper-reviewed.pdf?" — the skill auto-triggers.

Edge cases handled

✅ Multi-line underlines (reconstructed from per-line quad points)
✅ Author + timestamps preserved (handy for multi-reviewer documents)
✅ Mixed annotation types (Underline, Highlight, StrikeOut, Sticky Note, FreeText)
✅ Unicode text (Chinese, Japanese, Arabic — anything PyMuPDF can extract)
⚠️ Ink / drawn annotations — no text to extract, only location is reported
⚠️ Flattened PDFs — annotations were merged into the page graphics; nothing to read. Ask the sender to re-export with markup preserved.

Real-world story (where this came from)

Built in 30 minutes because an advisor returned a thesis as a PDF with 23 red underlines and zero visible comments. Tooltip-hovering 23 times to read the popups was unworkable; this script turned the lot into a markdown checklist in 0.4 seconds. The author then applied every fix in both the LaTeX source and the native Word version, then re-ran the extractor to confirm "zero annotations remain" before resubmitting.

Roadmap

PDF round-trip: apply changes and emit "addressed" stamps next to each annotation
--diff mode: pair annotations with the current source file and show what already matches
HTML report with anchored links back to the PDF
PyPI package

PRs welcome.

License

MIT — do whatever you want, attribution appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
extract_annotations.py		extract_annotations.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-annotation-reader

The problem (you've felt this)

What this does

Install

Usage

Markdown output looks like this

Why not just OCR / use a vision LLM?

Use as a Claude Code Skill

Edge cases handled

Real-world story (where this came from)

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-annotation-reader

The problem (you've felt this)

What this does

Install

Usage

Markdown output looks like this

Why not just OCR / use a vision LLM?

Use as a Claude Code Skill

Edge cases handled

Real-world story (where this came from)

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages