Extractor - Self-Correcting Agentic Document Processing System

Advanced multi-format document extraction system with self-correcting AI agents, annotation-guided learning, and continuous improvement through metadata accumulation. Handles PDFs, DOCX, PPTX, XML, HTML, and more with enterprise-grade accuracy.

Quick Start

# Extract any document (auto-detects format and preset)
python -m extractor.pipeline paper.pdf --out results/

# Fast mode (no LLM, quick extraction)
python -m extractor.pipeline paper.pdf --mode fast

# Accurate mode (full LLM pipeline)
python -m extractor.pipeline paper.pdf --use-llm

# Force specific preset (skip auto-detection)
python -m extractor.pipeline paper.pdf --preset arxiv

Complete Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              INPUT DOCUMENT                                  │
│                    (PDF, HTML, DOCX, XML, EPUB, PPTX, etc.)                 │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────────────┐
│                     STAGE 00: PROFILE DETECTOR                               │
│  • Analyze: domain, layout, tables, formulas, requirements                   │
│  • Match preset: arxiv, requirements_spec, auto                              │
│  • Determine route: fast vs accurate                                         │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────────────┐
│                     PDF EXTRACTION (STAGES 01-06)                            │
│                                                                              │
│  s01 annotation_processor   │  Strip/process PDF annotations                │
│  s02 marker_extractor       │  Extract blocks via Marker/pymupdf4llm        │
│  s03 suspicious_headers     │  VLM verification of headers                  │
│  s04 section_builder        │  Build hierarchical sections                  │
│  s04a layout_audit          │  Audit layout quality                         │
│  s05 table_extractor        │  Extract tables (Camelot)                     │
│  s05b table_describer       │  VLM descriptions for tables                  │
│  s05c table_merger          │  Merge & deduplicate tables                   │
│  s06 figure_extractor       │  Extract figures/images                       │
│  s06b figure_describer      │  VLM descriptions for figures                 │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────────────┐
│                     COMMON STAGES (07-14)                                    │
│                                                                              │
│  s07 json_assembler         │  Assemble sections + tables + figures → JSON   │
│  s08 extract_requirements   │  Mine requirements (REQ-xxx, SHALL, MUST)     │
│  s08 lean4_theorem_prover   │  Formal proofs (scientific papers)            │
│  s09 section_summarizer     │  LLM summaries per section                    │
│  s10 markdown_exporter      │  Export to Markdown                           │
│  s10 arangodb_exporter      │  Sync to ArangoDB graph                       │
│  s14 report_generator       │  Generate extraction report                   │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────────────┐
│                          PIPELINE OUTPUTS                                    │
│                                                                              │
│  pipeline.duckdb            │  Queryable structured data                    │
│  04_sections.json           │  Hierarchical sections                        │
│  05_tables.json             │  Extracted tables with descriptions           │
│  06_figures.json            │  Extracted figures with descriptions          │
│  08_requirements.json       │  Mined requirements                           │
│  document.md                │  Human-readable Markdown                      │
│  ArangoDB Graph             │  documents → sections → requirements          │
└─────────────────────────────────────────────────────────────────────────────┘

Separation of Concerns

Extractor focuses on structure extraction. For knowledge generation (Q&A pairs), use the downstream skills:

EXTRACTOR (this project)     QRA SKILL (separate)        MEMORY SKILL (separate)
────────────────────────     ───────────────────         ────────────────────────
• Extract sections           • Generate Q&A pairs        • Store with embeddings
• Extract tables             • Validate grounding        • Search & recall
• Extract figures            • Domain-focused            • Learn patterns
• Extract requirements       • LLM-intensive
• Export to DB/MD

Output: Structured JSON      Output: Q&A pairs           Output: Searchable memory
Cost: Low (mostly offline)   Cost: High (LLM tokens)     Cost: Embedding

Orchestration: The /distill skill combines all three:

# One command: Extract → QRA → Memory
.pi/skills/distill/run.sh --file paper.pdf --scope research

Supported Formats & Parity

Cross-format parity measured against HTML reference:

Format	Method	Parity	Notes
Markdown	Direct parse	100%	Perfect structural match
DOCX	Native XML (python-docx)	100%	Perfect structural match
HTML	BeautifulSoup	Reference	Baseline for comparison
XML	defusedxml	90%	Structure preserved
PDF	14-stage pipeline	87%	Varies by document complexity
RST	docutils	85%	Section structure varies
EPUB	ebooklib	82%	Chapter structure varies
PPTX	python-pptx	81%	Slide-based structure
XLSX	openpyxl	16%	Expected (spreadsheet format)
Images	OCR/VLM	16%	Requires VLM for text extraction

ArangoDB Schema

s10_arangodb_exporter creates:

NODES (Vertices):
  documents/doc_{hash}     ← Root document node
  sections/sec_{hash}      ← Section nodes (hierarchical)
  requirements/req_{id}    ← Requirement nodes

EDGES:
  has_section: Document → Section
  has_section: Section → SubSection (hierarchy)
  has_requirement: Section → Requirement

Presets

Preset	Detected When	Features
arxiv	Academic papers (2-column, math, "Abstract/References")	Full LLM pipeline, Lean4 proving
requirements_spec	Engineering specs (REQ-xxx, "Shall", nested sections)	Requirements mining enabled
auto	Unknown documents	Heuristic-based routing

CLI Reference

# Basic extraction
python -m extractor.pipeline <pdf> --out <dir>

# Key flags
--preset <name>          # Force preset (arxiv, requirements_spec, auto)
--use-llm                # Enable LLM for improved accuracy
--offline-smoke          # Deterministic mode, no network calls
--skip-export            # Don't write to ArangoDB
--extract-requirements   # Mine requirements (s08)
--annotate-pdf           # Generate annotated PDF with overlays
--auto-ocr/--no-auto-ocr # Control scanned PDF handling
--skip-scanned           # Skip detected scanned PDFs
--ocr-lang <code(s)>     # Set OCR language (default: eng)


# Batch processing
python -m extractor.pipeline ./documents/ --out ./results --glob "**/*.pdf"

Environment Variables

# Required for LLM stages
CHUTES_API_BASE=https://llm.chutes.ai/v1
CHUTES_API_KEY=<your-key>
CHUTES_VLM_MODEL=Qwen/Qwen3-VL-235B-A22B-Instruct
CHUTES_TEXT_MODEL=moonshotai/Kimi-K2-Instruct-0905

# Required for ArangoDB export
ARANGO_HOST=http://localhost:8529
ARANGO_DB=extractor_graph
ARANGO_USER=root
ARANGO_PASSWORD=<password>

# Optional for Lean4 proving
SCILLM_API_BASE=http://localhost:8787/v1

Project Structure

src/extractor/
├── core/
│   ├── presets.py           # Preset registry (arxiv, requirements_spec)
│   ├── providers/           # Format-specific extractors
│   │   ├── docx.py
│   │   ├── html.py
│   │   ├── epub.py
│   │   └── ...
│   └── schema/
│       └── unified_document.py
├── pipeline/
│   ├── run_pipeline.py      # Main orchestrator
│   └── steps/
│       ├── s00_profile_detector.py
│       ├── s01_annotation_processor.py
│       ├── s02_marker_extractor.py
│       └── ... (14+ stages)
└── tools/
    └── tasks_loop/
        ├── sanity/          # API sanity scripts
        └── tasks/           # Task files

Related Skills (pi-mono)

Skill	Purpose
`/extractor`	User-facing interface to this pipeline
`/distill`	Extract → QRA → Memory (orchestrator)
`/qra`	Generate Q&A pairs from text
`/doc-to-qra`	Document → Memory (simplified)
`/memory`	Store and recall knowledge

License

Apache License 2.0 - See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,266 Commits
.claude		.claude
.github		.github
.skills @ 7b50707		.skills @ 7b50707
bench		bench
config/invariants		config/invariants
docs		docs
local		local
models		models
prototypes		prototypes
scenarios		scenarios
scripts		scripts
src		src
tests		tests
tools		tools
.claudeignore		.claudeignore
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
01_TASKS.md		01_TASKS.md
02_TABLE_FIDELITY_95.md		02_TABLE_FIDELITY_95.md
03_CLOSE_SHADOW_LEGO.md		03_CLOSE_SHADOW_LEGO.md
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTEXT.md		CONTEXT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
artifacts		artifacts
benchmarks		benchmarks
data		data
docker-compose.yml		docker-compose.yml
marker_upstream		marker_upstream
memory		memory
package.json		package.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run		run
run_merge_tabular_benchmark.py		run_merge_tabular_benchmark.py
screenshots		screenshots
static		static
test_corpus		test_corpus
test_validation		test_validation
uv.lock		uv.lock
workspace		workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extractor - Self-Correcting Agentic Document Processing System

Quick Start

Complete Pipeline Architecture

Separation of Concerns

Supported Formats & Parity

ArangoDB Schema

Presets

CLI Reference

Environment Variables

Project Structure

Related Skills (pi-mono)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extractor - Self-Correcting Agentic Document Processing System

Quick Start

Complete Pipeline Architecture

Separation of Concerns

Supported Formats & Parity

ArangoDB Schema

Presets

CLI Reference

Environment Variables

Project Structure

Related Skills (pi-mono)

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages