System design and implementation details for pdf_oxide.
- High-Level Architecture
- Core Components
- Data Flow
- Module Organization
- Algorithm Details
- Performance Design
- Extension Points
┌─────────────────────────────────────────────────────────────────┐
│ Public API Layer │
│ (PdfDocument, Exporters, Configuration) │
└───────────────────┬─────────────────────────────────────────────┘
│
┌───────────────────┴─────────────────────────────────────────────┐
│ Processing Pipeline │
│ │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ Lexer │──▶│ Parser │──▶│ Stream │──▶│ Layout │ │
│ │ │ │ │ │ Decoder │ │ Analysis │ │
│ └──────────┘ └─────────┘ └──────────┘ └─────────────┘ │
│ │ │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ │ │
│ │ Text │◀──│ Font │◀──│ Object │◀─────────┘ │
│ │Extractor │ │ Parser │ │ Resolver│ │
│ └──────────┘ └─────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ │
│ │ Markdown │ │ HTML │ │ Text │ │
│ │ Exporter │ │Exporter │ │ Exporter │ │
│ └──────────┘ └─────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────┴─────────────────────────────────────────────┐
│ Storage Layer │
│ (Memory-mapped files, Object cache, Font cache) │
└─────────────────────────────────────────────────────────────────┘
Purpose: Tokenize PDF byte stream into lexical tokens.
Key Features:
- Zero-copy tokenization using byte slices
- Handles all PDF token types (numbers, strings, names, operators)
- Efficient whitespace and comment skipping
- Position tracking for error reporting
Performance:
- ~1-2ms for typical PDFs
- O(n) complexity where n = file size
- Minimal allocations (uses
&[u8]slices)
Example:
pub struct Lexer<'a> {
data: &'a [u8],
position: usize,
}
impl<'a> Lexer<'a> {
pub fn next_token(&mut self) -> Result<Token<'a>> {
// Zero-copy token extraction
}
}Purpose: Parse PDF objects from token stream.
Key Features:
- Recursive descent parser
- Cycle detection for circular references
- Recursion depth limiting (max 100 levels)
- Proper error context propagation
Object Types:
- Null, Boolean, Integer, Real
- String, Name
- Array, Dictionary
- Stream, Reference
Example:
pub struct Parser<'a> {
lexer: Lexer<'a>,
objects: HashMap<ObjectId, Object>,
recursion_depth: u32,
}
impl<'a> Parser<'a> {
pub fn parse_object(&mut self) -> Result<Object> {
// Recursive object parsing with cycle detection
}
}Purpose: Decompress PDF streams using various filters.
Supported Filters:
- FlateDecode: Zlib/Deflate (most common)
- LZWDecode: LZW compression
- ASCII85Decode: ASCII85 encoding
- ASCIIHexDecode: Hex encoding
- RunLengthDecode: Run-length encoding
- DCTDecode: JPEG (passthrough)
- CCITTFaxDecode: CCITT Group 3/4
Design:
pub trait StreamFilter {
fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}
pub struct StreamDecoder {
filters: Vec<Box<dyn StreamFilter>>,
}Performance:
- Streaming decompression where possible
- Decompression bomb protection (10× ratio limit)
- Reusable buffers to minimize allocations
Purpose: Understand document spatial structure.
Submodules:
Algorithm: Density-Based Spatial Clustering of Applications with Noise
Parameters:
- Epsilon (ε): 1.5× median character height
- MinPts: 2 for chars→words, 3 for words→lines
Data Structure: R*-tree for O(log n) neighborhood queries
Performance: O(n log n) instead of naive O(n²)
pub struct DbscanClusterer {
spatial_index: RTree<Character>,
epsilon: f32,
min_points: usize,
}
impl DbscanClusterer {
pub fn cluster(&self, chars: &[Character]) -> Vec<Cluster> {
// DBSCAN with spatial indexing
}
}Algorithm: Recursive projection-based page segmentation
Steps:
- Project content onto X-axis (horizontal)
- Find gaps in projection (valleys)
- Split at largest gap
- Recurse on sub-regions
- Repeat for Y-axis
Configuration:
- Split threshold: 0.05 × page dimension (5%)
- Max recursion: 10 levels
- Gaussian smoothing: σ=2.0
pub struct XyCutSegmenter {
threshold: f32,
max_depth: u32,
}
impl XyCutSegmenter {
pub fn segment(&self, page: &Page) -> Vec<Region> {
self.segment_recursive(page, 0)
}
fn segment_recursive(&self, region: &Region, depth: u32) -> Vec<Region> {
// Recursive XY-Cut with projection analysis
}
}Purpose: Group text by font characteristics for formatting detection.
Features:
- Size tolerance: ±1pt
- Family exact matching
- Outlier rejection: fonts used in <2% of characters
Use Cases:
- Bold/italic detection
- Heading identification
- Style consistency checking
Purpose: Extract text with proper spacing and formatting.
Key Algorithms:
- Word spacing detection: 0.25× character width threshold
- Line spacing detection: 1.2× character height threshold
- Bold detection: Font weight analysis
- Reading order: Top-to-bottom, left-to-right (with column detection)
Quality:
- 100% word spacing accuracy (tested on 103 PDFs)
- 100% whitespace preservation
- 37% more bold sections detected vs. reference
pub struct TextExtractor {
config: TextConfig,
font_cache: HashMap<String, Font>,
}
impl TextExtractor {
pub fn extract(&mut self, page: &Page) -> Result<String> {
// Character clustering → word formation → line formation → text
}
}Purpose: Parse PDF fonts and handle character encoding.
Features:
- Standard fonts: Built-in Type1 fonts (Times, Helvetica, etc.)
- ToUnicode CMap: Unicode mapping for custom encodings
- Encoding tables: PDFDocEncoding, MacRomanEncoding, WinAnsiEncoding
- Font metrics: Character widths, ascent, descent
Caching:
- Parsed fonts cached per document
- Reduces re-parsing overhead for multi-page documents
- ~25% speedup on typical documents
Purpose: Extract embedded images from PDFs.
Supported Formats:
- JPEG (DCTDecode)
- PNG (FlateDecode + predictor)
- TIFF (CCITTFaxDecode)
- Raw RGB/CMYK
Features:
- Color space conversion (CMYK → RGB, etc.)
- Predictor functions (PNG filters)
- Decompression
- Metadata extraction (dimensions, DPI)
Purpose: Convert extracted content to various formats.
Quality: 99.8/100 (tested on 103 PDFs)
Features:
- Bold text detection and formatting
- Heading detection (font size heuristics)
- List detection
- Link preservation
- Proper whitespace normalization
Quality: 94.0/100
Features:
- Semantic HTML5
- URL/email linkification
- Bold/italic tags
- Proper entity encoding
- CSS-friendly structure
Quality: 100.0/100 (whitespace)
Features:
- Whitespace normalization
- Word spacing preservation
- Line break consistency
1. User calls PdfDocument::open("file.pdf")
↓
2. Memory-map file (or load into buffer)
↓
3. Parse header, locate cross-reference table
↓
4. Build object index (lazy loading)
↓
5. Parse catalog and page tree
↓
6. Return PdfDocument handle
1. User calls doc.extract_text(page_num)
↓
2. Resolve page object from page tree
↓
3. Get content stream(s)
↓
4. Decode streams (decompress)
↓
5. Parse content stream operators
↓
6. Extract character data with positions
↓
7. Cluster characters into words (DBSCAN)
↓
8. Cluster words into lines (DBSCAN)
↓
9. Detect columns (XY-Cut)
↓
10. Sort into reading order
↓
11. Format text with proper spacing
↓
12. Return extracted text
1. User calls exporter.export_all(doc)
↓
2. For each page:
├─ Extract text blocks with metadata
├─ Detect bold sections (font analysis)
├─ Detect headings (size heuristics)
├─ Detect lists (indentation + markers)
↓
3. Format as Markdown:
├─ Headings: # ## ###
├─ Bold: **text**
├─ Lists: - item
├─ Links: [text](url)
↓
4. Concatenate pages with separators
↓
5. Return Markdown string
src/
├── lib.rs # Public API, re-exports
├── error.rs # Error types
├── object.rs # PDF object types
├── lexer.rs # Tokenization
├── parser.rs # Object parsing
├── document.rs # High-level API
├── stream_decoder.rs # Stream decompression
├── font.rs # Font parsing
├── text.rs # Text extraction
├── image.rs # Image extraction
├── form.rs # Form field extraction
├── bookmark.rs # Bookmark extraction
├── annotation.rs # Annotation extraction
│
├── layout/ # Layout analysis
│ ├── mod.rs
│ ├── dbscan.rs # DBSCAN clustering
│ ├── xy_cut.rs # XY-Cut segmentation
│ ├── font_clustering.rs
│ └── spatial_index.rs # R*-tree
│
├── converters/ # Export formats
│ ├── mod.rs
│ ├── markdown.rs # Markdown exporter
│ ├── html.rs # HTML exporter
│ └── text.rs # Plain text exporter
│
└── utils/ # Utilities
├── mod.rs
├── encoding.rs # Character encoding
└── geometry.rs # Bounding box math
python/ # Python bindings
├── src/
│ └── lib.rs # PyO3 bindings
└── pyproject.toml
tests/ # Integration tests
├── test_parsing.rs
├── test_text_extraction.rs
├── test_exporters.rs
└── fixtures/ # Test PDFs
benches/ # Performance benchmarks
└── parsing.rs
examples/ # Usage examples
├── basic.rs
├── batch_processing.rs
└── custom_exporter.rs
Pseudo-code:
function dbscan(points, epsilon, minPts):
clusters = []
visited = set()
for point in points:
if point in visited:
continue
visited.add(point)
neighbors = find_neighbors(point, epsilon) # O(log n) with R*-tree
if len(neighbors) < minPts:
# Noise point
continue
# Start new cluster
cluster = [point]
queue = neighbors
while queue:
neighbor = queue.pop()
if neighbor not in visited:
visited.add(neighbor)
new_neighbors = find_neighbors(neighbor, epsilon)
if len(new_neighbors) >= minPts:
queue.extend(new_neighbors)
if neighbor not in any cluster:
cluster.append(neighbor)
clusters.append(cluster)
return clusters
Optimization: R*-tree spatial index allows O(log n) neighbor queries instead of O(n) linear scan.
Pseudo-code:
function xy_cut(region, depth):
if depth > MAX_DEPTH:
return [region]
# Project onto X-axis
x_projection = project_horizontal(region)
x_gaps = find_gaps(x_projection, threshold)
if x_gaps:
split_x = largest_gap(x_gaps)
left, right = split_region(region, split_x, vertical=True)
return xy_cut(left, depth+1) + xy_cut(right, depth+1)
# Project onto Y-axis
y_projection = project_vertical(region)
y_gaps = find_gaps(y_projection, threshold)
if y_gaps:
split_y = largest_gap(y_gaps)
top, bottom = split_region(region, split_y, vertical=False)
return xy_cut(top, depth+1) + xy_cut(bottom, depth+1)
# No split found
return [region]
Algorithm:
For each pair of adjacent characters (c1, c2):
gap = c2.x - (c1.x + c1.width)
threshold = 0.25 * c1.width
if gap > threshold:
insert_space()
Dynamic threshold: Uses character width for adaptive spacing detection.
Technique: Use byte slices (&[u8]) instead of copying data.
Example:
// Zero-copy token
pub enum Token<'a> {
Name(&'a [u8]),
String(&'a [u8]),
// ...
}
// No allocation needed
let token = lexer.next_token()?;Technique: Use mmap for large files.
Benefits:
- OS handles memory management
- Lazy loading (only accessed pages loaded)
- Shared memory across processes
Strategy: Cache parsed objects by object ID.
Benefits:
- Avoid re-parsing referenced objects
- O(1) lookup for cross-references
- Memory bounded by unique object count
Technique: Process pages one at a time.
Benefits:
- Constant memory regardless of document size
- Can process 10 GB documents with <100 MB RAM
Implement the Exporter trait:
pub trait Exporter {
fn export_page(&self, doc: &mut PdfDocument, page_num: u32) -> Result<String>;
fn export_all(&self, doc: &mut PdfDocument) -> Result<String>;
}Implement the StreamFilter trait:
pub trait StreamFilter {
fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}Replace or extend layout analysis:
pub trait LayoutAnalyzer {
fn analyze(&self, page: &Page) -> Result<Layout>;
}PDF as untrusted input:
- Limit file size (default: 500 MB)
- Limit object count (default: 1M objects)
- Limit recursion depth (default: 100 levels)
- Validate array/dictionary sizes
Protection:
- Limit decompressed size to 10× compressed size
- Check compression ratio before full decompression
- Stream decompression where possible
Rust guarantees:
- No buffer overflows
- No use-after-free
- No data races
- Bounds checking on all array accesses
Minimal unsafe:
- Used sparingly (e.g., performance-critical paths)
- Thoroughly documented with safety invariants
- Reviewed by multiple developers
v1.x:
- Stream API for ultra-low memory usage
- Table detection (smart heuristics)
- Rotated text handling
v2.x:
- ML-based layout analysis (optional)
- GPU acceleration (optional)
- Parallel page processing
v3.x:
- OCR integration (Tesseract)
- PDF/A validation
- Advanced form handling
- PDF Specification: ISO 32000-1:2008 (PDF 1.7) -
docs/spec/pdf.md - Architecture: ARCHITECTURE.md - Detailed system design
- DBSCAN: Ester et al., "A Density-Based Algorithm for Discovering Clusters" (1996)
- XY-Cut: Nagy & Seth, "Hierarchical Representation of Optically Scanned Documents" (1984)
- R-tree*: Beckmann et al., "The R*-tree: An Efficient and Robust Access Method" (1990)
- GitHub Issues: https://github.com/yfedoseev/pdf_oxide/issues
- Discussions: https://github.com/yfedoseev/pdf_oxide/discussions
- Documentation: https://docs.rs/pdf_oxide