Skip to content

Latest commit

 

History

History
664 lines (516 loc) · 18 KB

File metadata and controls

664 lines (516 loc) · 18 KB

Architecture Overview

System design and implementation details for pdf_oxide.

Table of Contents

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Public API Layer                          │
│  (PdfDocument, Exporters, Configuration)                        │
└───────────────────┬─────────────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────────────┐
│                     Processing Pipeline                          │
│                                                                  │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐   ┌─────────────┐ │
│  │  Lexer   │──▶│ Parser  │──▶│  Stream  │──▶│   Layout    │ │
│  │          │   │         │   │ Decoder  │   │  Analysis   │ │
│  └──────────┘   └─────────┘   └──────────┘   └─────────────┘ │
│                                                       │          │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐          │         │
│  │   Text   │◀──│  Font   │◀──│  Object  │◀─────────┘         │
│  │Extractor │   │ Parser  │   │  Resolver│                     │
│  └──────────┘   └─────────┘   └──────────┘                     │
│        │                                                         │
│        ▼                                                         │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐                    │
│  │ Markdown │   │  HTML   │   │   Text   │                    │
│  │ Exporter │   │Exporter │   │ Exporter │                    │
│  └──────────┘   └─────────┘   └──────────┘                    │
└─────────────────────────────────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────────────┐
│                     Storage Layer                                │
│  (Memory-mapped files, Object cache, Font cache)               │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Lexer (src/lexer.rs)

Purpose: Tokenize PDF byte stream into lexical tokens.

Key Features:

  • Zero-copy tokenization using byte slices
  • Handles all PDF token types (numbers, strings, names, operators)
  • Efficient whitespace and comment skipping
  • Position tracking for error reporting

Performance:

  • ~1-2ms for typical PDFs
  • O(n) complexity where n = file size
  • Minimal allocations (uses &[u8] slices)

Example:

pub struct Lexer<'a> {
    data: &'a [u8],
    position: usize,
}

impl<'a> Lexer<'a> {
    pub fn next_token(&mut self) -> Result<Token<'a>> {
        // Zero-copy token extraction
    }
}

2. Parser (src/parser.rs)

Purpose: Parse PDF objects from token stream.

Key Features:

  • Recursive descent parser
  • Cycle detection for circular references
  • Recursion depth limiting (max 100 levels)
  • Proper error context propagation

Object Types:

  • Null, Boolean, Integer, Real
  • String, Name
  • Array, Dictionary
  • Stream, Reference

Example:

pub struct Parser<'a> {
    lexer: Lexer<'a>,
    objects: HashMap<ObjectId, Object>,
    recursion_depth: u32,
}

impl<'a> Parser<'a> {
    pub fn parse_object(&mut self) -> Result<Object> {
        // Recursive object parsing with cycle detection
    }
}

3. Stream Decoder (src/stream_decoder.rs)

Purpose: Decompress PDF streams using various filters.

Supported Filters:

  • FlateDecode: Zlib/Deflate (most common)
  • LZWDecode: LZW compression
  • ASCII85Decode: ASCII85 encoding
  • ASCIIHexDecode: Hex encoding
  • RunLengthDecode: Run-length encoding
  • DCTDecode: JPEG (passthrough)
  • CCITTFaxDecode: CCITT Group 3/4

Design:

pub trait StreamFilter {
    fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}

pub struct StreamDecoder {
    filters: Vec<Box<dyn StreamFilter>>,
}

Performance:

  • Streaming decompression where possible
  • Decompression bomb protection (10× ratio limit)
  • Reusable buffers to minimize allocations

4. Layout Analysis (src/layout/)

Purpose: Understand document spatial structure.

Submodules:

4.1 DBSCAN Clustering (layout/dbscan.rs)

Algorithm: Density-Based Spatial Clustering of Applications with Noise

Parameters:

  • Epsilon (ε): 1.5× median character height
  • MinPts: 2 for chars→words, 3 for words→lines

Data Structure: R*-tree for O(log n) neighborhood queries

Performance: O(n log n) instead of naive O(n²)

pub struct DbscanClusterer {
    spatial_index: RTree<Character>,
    epsilon: f32,
    min_points: usize,
}

impl DbscanClusterer {
    pub fn cluster(&self, chars: &[Character]) -> Vec<Cluster> {
        // DBSCAN with spatial indexing
    }
}

4.2 XY-Cut (layout/xy_cut.rs)

Algorithm: Recursive projection-based page segmentation

Steps:

  1. Project content onto X-axis (horizontal)
  2. Find gaps in projection (valleys)
  3. Split at largest gap
  4. Recurse on sub-regions
  5. Repeat for Y-axis

Configuration:

  • Split threshold: 0.05 × page dimension (5%)
  • Max recursion: 10 levels
  • Gaussian smoothing: σ=2.0
pub struct XyCutSegmenter {
    threshold: f32,
    max_depth: u32,
}

impl XyCutSegmenter {
    pub fn segment(&self, page: &Page) -> Vec<Region> {
        self.segment_recursive(page, 0)
    }

    fn segment_recursive(&self, region: &Region, depth: u32) -> Vec<Region> {
        // Recursive XY-Cut with projection analysis
    }
}

4.3 Font Clustering (layout/font_clustering.rs)

Purpose: Group text by font characteristics for formatting detection.

Features:

  • Size tolerance: ±1pt
  • Family exact matching
  • Outlier rejection: fonts used in <2% of characters

Use Cases:

  • Bold/italic detection
  • Heading identification
  • Style consistency checking

5. Text Extraction (src/text.rs)

Purpose: Extract text with proper spacing and formatting.

Key Algorithms:

  • Word spacing detection: 0.25× character width threshold
  • Line spacing detection: 1.2× character height threshold
  • Bold detection: Font weight analysis
  • Reading order: Top-to-bottom, left-to-right (with column detection)

Quality:

  • 100% word spacing accuracy (tested on 103 PDFs)
  • 100% whitespace preservation
  • 37% more bold sections detected vs. reference
pub struct TextExtractor {
    config: TextConfig,
    font_cache: HashMap<String, Font>,
}

impl TextExtractor {
    pub fn extract(&mut self, page: &Page) -> Result<String> {
        // Character clustering → word formation → line formation → text
    }
}

6. Font Parser (src/font.rs)

Purpose: Parse PDF fonts and handle character encoding.

Features:

  • Standard fonts: Built-in Type1 fonts (Times, Helvetica, etc.)
  • ToUnicode CMap: Unicode mapping for custom encodings
  • Encoding tables: PDFDocEncoding, MacRomanEncoding, WinAnsiEncoding
  • Font metrics: Character widths, ascent, descent

Caching:

  • Parsed fonts cached per document
  • Reduces re-parsing overhead for multi-page documents
  • ~25% speedup on typical documents

7. Image Extraction (src/image.rs)

Purpose: Extract embedded images from PDFs.

Supported Formats:

  • JPEG (DCTDecode)
  • PNG (FlateDecode + predictor)
  • TIFF (CCITTFaxDecode)
  • Raw RGB/CMYK

Features:

  • Color space conversion (CMYK → RGB, etc.)
  • Predictor functions (PNG filters)
  • Decompression
  • Metadata extraction (dimensions, DPI)

8. Exporters (src/converters/)

Purpose: Convert extracted content to various formats.

8.1 Markdown Exporter

Quality: 99.8/100 (tested on 103 PDFs)

Features:

  • Bold text detection and formatting
  • Heading detection (font size heuristics)
  • List detection
  • Link preservation
  • Proper whitespace normalization

8.2 HTML Exporter

Quality: 94.0/100

Features:

  • Semantic HTML5
  • URL/email linkification
  • Bold/italic tags
  • Proper entity encoding
  • CSS-friendly structure

8.3 Plain Text Exporter

Quality: 100.0/100 (whitespace)

Features:

  • Whitespace normalization
  • Word spacing preservation
  • Line break consistency

Data Flow

Opening a Document

1. User calls PdfDocument::open("file.pdf")
   ↓
2. Memory-map file (or load into buffer)
   ↓
3. Parse header, locate cross-reference table
   ↓
4. Build object index (lazy loading)
   ↓
5. Parse catalog and page tree
   ↓
6. Return PdfDocument handle

Extracting Text from a Page

1. User calls doc.extract_text(page_num)
   ↓
2. Resolve page object from page tree
   ↓
3. Get content stream(s)
   ↓
4. Decode streams (decompress)
   ↓
5. Parse content stream operators
   ↓
6. Extract character data with positions
   ↓
7. Cluster characters into words (DBSCAN)
   ↓
8. Cluster words into lines (DBSCAN)
   ↓
9. Detect columns (XY-Cut)
   ↓
10. Sort into reading order
   ↓
11. Format text with proper spacing
   ↓
12. Return extracted text

Exporting to Markdown

1. User calls exporter.export_all(doc)
   ↓
2. For each page:
   ├─ Extract text blocks with metadata
   ├─ Detect bold sections (font analysis)
   ├─ Detect headings (size heuristics)
   ├─ Detect lists (indentation + markers)
   ↓
3. Format as Markdown:
   ├─ Headings: # ## ###
   ├─ Bold: **text**
   ├─ Lists: - item
   ├─ Links: [text](url)
   ↓
4. Concatenate pages with separators
   ↓
5. Return Markdown string

Module Organization

src/
├── lib.rs              # Public API, re-exports
├── error.rs            # Error types
├── object.rs           # PDF object types
├── lexer.rs            # Tokenization
├── parser.rs           # Object parsing
├── document.rs         # High-level API
├── stream_decoder.rs   # Stream decompression
├── font.rs             # Font parsing
├── text.rs             # Text extraction
├── image.rs            # Image extraction
├── form.rs             # Form field extraction
├── bookmark.rs         # Bookmark extraction
├── annotation.rs       # Annotation extraction
│
├── layout/             # Layout analysis
│   ├── mod.rs
│   ├── dbscan.rs       # DBSCAN clustering
│   ├── xy_cut.rs       # XY-Cut segmentation
│   ├── font_clustering.rs
│   └── spatial_index.rs  # R*-tree
│
├── converters/         # Export formats
│   ├── mod.rs
│   ├── markdown.rs     # Markdown exporter
│   ├── html.rs         # HTML exporter
│   └── text.rs         # Plain text exporter
│
└── utils/              # Utilities
    ├── mod.rs
    ├── encoding.rs     # Character encoding
    └── geometry.rs     # Bounding box math

python/                 # Python bindings
├── src/
│   └── lib.rs         # PyO3 bindings
└── pyproject.toml

tests/                  # Integration tests
├── test_parsing.rs
├── test_text_extraction.rs
├── test_exporters.rs
└── fixtures/          # Test PDFs

benches/               # Performance benchmarks
└── parsing.rs

examples/              # Usage examples
├── basic.rs
├── batch_processing.rs
└── custom_exporter.rs

Algorithm Details

DBSCAN Clustering

Pseudo-code:

function dbscan(points, epsilon, minPts):
    clusters = []
    visited = set()

    for point in points:
        if point in visited:
            continue

        visited.add(point)
        neighbors = find_neighbors(point, epsilon)  # O(log n) with R*-tree

        if len(neighbors) < minPts:
            # Noise point
            continue

        # Start new cluster
        cluster = [point]
        queue = neighbors

        while queue:
            neighbor = queue.pop()
            if neighbor not in visited:
                visited.add(neighbor)
                new_neighbors = find_neighbors(neighbor, epsilon)

                if len(new_neighbors) >= minPts:
                    queue.extend(new_neighbors)

            if neighbor not in any cluster:
                cluster.append(neighbor)

        clusters.append(cluster)

    return clusters

Optimization: R*-tree spatial index allows O(log n) neighbor queries instead of O(n) linear scan.

XY-Cut Segmentation

Pseudo-code:

function xy_cut(region, depth):
    if depth > MAX_DEPTH:
        return [region]

    # Project onto X-axis
    x_projection = project_horizontal(region)
    x_gaps = find_gaps(x_projection, threshold)

    if x_gaps:
        split_x = largest_gap(x_gaps)
        left, right = split_region(region, split_x, vertical=True)
        return xy_cut(left, depth+1) + xy_cut(right, depth+1)

    # Project onto Y-axis
    y_projection = project_vertical(region)
    y_gaps = find_gaps(y_projection, threshold)

    if y_gaps:
        split_y = largest_gap(y_gaps)
        top, bottom = split_region(region, split_y, vertical=False)
        return xy_cut(top, depth+1) + xy_cut(bottom, depth+1)

    # No split found
    return [region]

Word Spacing Detection

Algorithm:

For each pair of adjacent characters (c1, c2):
    gap = c2.x - (c1.x + c1.width)
    threshold = 0.25 * c1.width

    if gap > threshold:
        insert_space()

Dynamic threshold: Uses character width for adaptive spacing detection.

Performance Design

Zero-Copy Parsing

Technique: Use byte slices (&[u8]) instead of copying data.

Example:

// Zero-copy token
pub enum Token<'a> {
    Name(&'a [u8]),
    String(&'a [u8]),
    // ...
}

// No allocation needed
let token = lexer.next_token()?;

Memory-Mapped Files

Technique: Use mmap for large files.

Benefits:

  • OS handles memory management
  • Lazy loading (only accessed pages loaded)
  • Shared memory across processes

Object Caching

Strategy: Cache parsed objects by object ID.

Benefits:

  • Avoid re-parsing referenced objects
  • O(1) lookup for cross-references
  • Memory bounded by unique object count

Streaming

Technique: Process pages one at a time.

Benefits:

  • Constant memory regardless of document size
  • Can process 10 GB documents with <100 MB RAM

Extension Points

Custom Exporters

Implement the Exporter trait:

pub trait Exporter {
    fn export_page(&self, doc: &mut PdfDocument, page_num: u32) -> Result<String>;
    fn export_all(&self, doc: &mut PdfDocument) -> Result<String>;
}

Custom Stream Filters

Implement the StreamFilter trait:

pub trait StreamFilter {
    fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}

Custom Layout Analyzers

Replace or extend layout analysis:

pub trait LayoutAnalyzer {
    fn analyze(&self, page: &Page) -> Result<Layout>;
}

Security Considerations

Input Validation

PDF as untrusted input:

  • Limit file size (default: 500 MB)
  • Limit object count (default: 1M objects)
  • Limit recursion depth (default: 100 levels)
  • Validate array/dictionary sizes

Decompression Bombs

Protection:

  • Limit decompressed size to 10× compressed size
  • Check compression ratio before full decompression
  • Stream decompression where possible

Memory Safety

Rust guarantees:

  • No buffer overflows
  • No use-after-free
  • No data races
  • Bounds checking on all array accesses

Minimal unsafe:

  • Used sparingly (e.g., performance-critical paths)
  • Thoroughly documented with safety invariants
  • Reviewed by multiple developers

Future Architecture

Planned Improvements

v1.x:

  • Stream API for ultra-low memory usage
  • Table detection (smart heuristics)
  • Rotated text handling

v2.x:

  • ML-based layout analysis (optional)
  • GPU acceleration (optional)
  • Parallel page processing

v3.x:

  • OCR integration (Tesseract)
  • PDF/A validation
  • Advanced form handling

References

  • PDF Specification: ISO 32000-1:2008 (PDF 1.7) - docs/spec/pdf.md
  • Architecture: ARCHITECTURE.md - Detailed system design
  • DBSCAN: Ester et al., "A Density-Based Algorithm for Discovering Clusters" (1996)
  • XY-Cut: Nagy & Seth, "Hierarchical Representation of Optically Scanned Documents" (1984)
  • R-tree*: Beckmann et al., "The R*-tree: An Efficient and Robust Access Method" (1990)

Questions?