Stop squinting at red underlines. One command turns a "reviewed" PDF into a clean checklist of what the reviewer actually wrote.
Your advisor / lawyer / editor sends back a PDF covered in red marks. You open it and see... only the marks. The actual comments — "change this to that", "rewrite this paragraph", "the equation is wrong" — are hidden inside annotation popups that never get printed and rarely get exported.
So you either:
- 🔁 Click every single mark to read each tooltip (and lose your place)
- 📞 Email back asking "can you send the comments as text?"
- 🤞 Guess what they meant from the highlight color alone
There's a better way. Reviewer comments live in the PDF as structured Annot objects. We just have to read them.
$ python extract_annotations.py thesis-reviewed.pdf
===== Page 1 =====
[1] Underline by 23727
underlined: 'Lianyungang'
★ comment : '后面加City'
===== Page 3 =====
[1] Underline by 23727
underlined: '构建'
★ comment : '改成使用'
===== Page 24 =====
[1] Underline by 23727
underlined: 'x(0)(k) + az(1)(k) = b'
★ comment : '微分方程写的不对'
Total: 23 annotationsNow you have an actionable list. Fix → rebuild → re-run to confirm zero remain.
git clone https://github.com/TsekaLuk/pdf-annotation-reader.git
cd pdf-annotation-reader
pip install -r requirements.txt # just pymupdfOr one-liner, no clone:
pip install pymupdf
curl -O https://raw.githubusercontent.com/TsekaLuk/pdf-annotation-reader/main/extract_annotations.py
python extract_annotations.py your.pdf# Default: terminal-friendly text
python extract_annotations.py reviewed.pdf
# Markdown table — paste into a report, Notion, or PR description
python extract_annotations.py reviewed.pdf --markdown > review.md
# JSON for tooling (LLM input, ticket creation, diff generation)
python extract_annotations.py reviewed.pdf --json > review.json| Page | Type | Underlined | Comment | Author |
|---|---|---|---|---|
| 1 | Underline | Lianyungang |
后面加City | 23727 |
| 3 | Underline | 构建 |
改成使用 | 23727 |
| 24 | Underline | x⁽⁰⁾(k) + az⁽¹⁾(k) = b |
微分方程写的不对 | 23727 |
| Approach | Speed | Cost | Reliability |
|---|---|---|---|
| Vision LLM ("what's in this PDF?") | 🐢 slow | 💸 per-token | 🎲 hallucinates underlines that don't exist |
| OCR + heuristics | 🐢 slow | 💵 | 🎲 misses popup comments entirely |
This tool (PyMuPDF annotation API) |
⚡️ instant | 🆓 free | 🎯 reads the source metadata directly |
The annotation popup text is already stored in the PDF as plain Unicode. There's no inference involved.
This repo ships a SKILL.md so Claude Code (and any tool that loads SKILL.md frontmatter) can invoke it as a first-class skill.
# in your ~/.claude/skills/ or project skills dir:
git clone https://github.com/TsekaLuk/pdf-annotation-reader.gitThen ask Claude: "What are the reviewer's comments on paper-reviewed.pdf?" — the skill auto-triggers.
- ✅ Multi-line underlines (reconstructed from per-line quad points)
- ✅ Author + timestamps preserved (handy for multi-reviewer documents)
- ✅ Mixed annotation types (Underline, Highlight, StrikeOut, Sticky Note, FreeText)
- ✅ Unicode text (Chinese, Japanese, Arabic — anything PyMuPDF can extract)
⚠️ Ink / drawn annotations — no text to extract, only location is reported⚠️ Flattened PDFs — annotations were merged into the page graphics; nothing to read. Ask the sender to re-export with markup preserved.
Built in 30 minutes because an advisor returned a thesis as a PDF with 23 red underlines and zero visible comments. Tooltip-hovering 23 times to read the popups was unworkable; this script turned the lot into a markdown checklist in 0.4 seconds. The author then applied every fix in both the LaTeX source and the native Word version, then re-ran the extractor to confirm "zero annotations remain" before resubmitting.
- PDF round-trip: apply changes and emit "addressed" stamps next to each annotation
-
--diffmode: pair annotations with the current source file and show what already matches - HTML report with anchored links back to the PDF
- PyPI package
PRs welcome.
MIT — do whatever you want, attribution appreciated.