Skip to content
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
80fc452
feat(extraction): #734 two-column structured + tagged structure surfa…
yfedoseev Jun 13, 2026
c3934f1
feat(extraction): v0.3.65 multilingual+layout quality batch — RTL bid…
yfedoseev Jun 14, 2026
bb20771
feat(extraction): RW-1e figure-XObject /BBox clip + SEG-HE/SEG-AR RTL…
yfedoseev Jun 15, 2026
12443e8
perf(pdf-to-ir): skip O(spans*chars) rotated-char filter on unrotated…
yfedoseev Jun 15, 2026
3352b2a
perf: remove O(n^2)/O(n*m) hotspots in table filter, xycut, hyphen me…
yfedoseev Jun 15, 2026
9f84d69
fix(extraction): RW-1b skip numbered-list markers in rowspan-label re…
yfedoseev Jun 15, 2026
9b7c6d2
perf(extraction): drop-cap initial pairing O(n^2) -> windowed binary …
yfedoseev Jun 15, 2026
5bc6b83
fix(extraction): SEG-AR preserve true glyph x when merging scrambled-…
yfedoseev Jun 15, 2026
83a7b81
fix(reading-order): RW-1 D1 peel bottom-spanning trailing blocks afte…
yfedoseev Jun 15, 2026
1bf70a0
fix(reading-order): RW-1 D3 segregate publisher-metadata sidebar via …
yfedoseev Jun 15, 2026
fc39353
fix(reading-order): RW-1 D3 extend publisher-sidebar segregation to m…
yfedoseev Jun 15, 2026
27acc44
fix(extraction): SEG-AR-B suppress geometric word-shatter on /Reverse…
yfedoseev Jun 15, 2026
dcc6865
docs: note md/html RTL whole-span-sort limitation at apply_rtl_logica…
yfedoseev Jun 15, 2026
72a9e0f
fix(decode): #738 in-house CCITT Group 4 decoder honoring EncodedByte…
yfedoseev Jun 16, 2026
b9a2dbd
release: bump version to 0.3.65
yfedoseev Jun 16, 2026
b117205
fix(extraction): gate RW-1e form /BBox clip to figure-sized forms only
yfedoseev Jun 16, 2026
bb44e59
release: v0.3.65 changelog + version bump across all bindings
yfedoseev Jun 16, 2026
254182e
ci: fix rustfmt, clippy -D warnings, and rw1 reading-order lock test
yfedoseev Jun 16, 2026
f960cde
docs: fix broken rustdoc intra-doc links (private-item targets)
yfedoseev Jun 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,30 @@

All notable changes to PDFOxide are documented here.

## [0.3.65] - 2026-06-16

> Multilingual and layout extraction quality — right-to-left bidi reconstruction for Arabic and Hebrew, multi-region reading order for publisher sidebars and two-column academic pages, and CJK/Indic word segmentation — plus an in-house CCITT Group 4 fax decoder that honours `EncodedByteAlign`, structured two-column surfacing, and a batch of O(n²) hot-path removals.

### Added

- **Two-column structured extraction and tagged-structure surfacing (#734)** — `extract_structured` now reports a per-line `column_index` for multi-column pages and, on tagged PDFs, surfaces marginal labels (`Lbl` → marginal label) and the nearest enclosing section (`Sect`/`Art`/`Part` → a document-stable `section_id` with cross-page continuity), per ISO 32000-1 §14.8.4. Additive and zero-risk for untagged input. Thanks @lggcs.
- **Reading-order threads for linked content (#458)** — article-thread (`/Threads` → `/B` bead) ordering is surfaced so content that flows across columns and pages can be read in author-intended order.
- **In-house CCITT Group 4 (T.6) fax decoder (#738)** — a from-scratch decoder for `CCITTFaxDecode` images that correctly honours `EncodedByteAlign` (ITU-T T.6 2D mode codes, Modified-Huffman run tables, reference-line changing-element walk), with partial-row recovery on truncated streams. Replaces a path that could silently fall back to an all-white image; bilevel fax images now decode to their real content. Thanks @potatochipcoconut.

### Fixed

- **Right-to-left Arabic/Hebrew text reconstructed in logical order** — several classes of RTL extraction defect are corrected so Arabic and Hebrew read correctly instead of scrambled:
- **Cross-span cluster reversal (SEG-AR)** — producers that draw an Arabic word as interleaved base-glyph and zero-width mark spans (the mark's x falling *inside* a neighbouring word) had their letters atom-sorted to word edges, scrambling e.g. `الثدييات` → `ثالدييات`. Pure-RTL lines with such zero-width-inside-a-span runs are now collapsed into a single visual-order span — glyphs ordered by x, combining marks bound to their base, word boundaries taken from the producer's own standalone space spans — then reversed to logical order (UAX #9 L2). A representative Arabic page improved from a heavily garbled paragraph to fully correct text.
- **RTL number preservation (SEG-AR / SEG-HE)** — Arabic-Indic and Latin digit runs embedded in RTL text are no longer reversed: `٤٣٤١` now reads `١٤٣٤` (1434) and a Hebrew `ל ,2009-` now reads `ל-2009,`, matching a conformant bidi reorder.
- **Glyph-advance preservation when merging scrambled-RTL spans** — merging adjacent RTL spans no longer corrupts true glyph positions, and a real word break bordering non-cursive punctuation is kept (rather than suppressed as a cursive-shatter space) on `/ReversedChars` producers.
- **Multi-region reading order for publisher sidebars and two-column pages** — narrow publisher-metadata sidebars are now segregated from the body and emitted after it (title and body merged top-to-bottom, sidebar last) instead of being interleaved, across text, Markdown, and HTML. Bottom-spanning blocks that follow a multi-column region are peeled correctly, numbered-list markers are skipped in rowspan-label reordering, and two-column prose is linearised column-major. A figure Form XObject's `/BBox` clip (ISO 32000-1 §8.10.1) now drops a draft-galley underlay a conformant renderer would clip — gated to figure-sized forms so a full-page content-frame wrapper keeps its body.
- **CJK and Indic word segmentation** — Korean number/counter spacing and line-break rejoining (`1 만년` → `1만년`), and stray spaces before Bengali/Devanagari/Latin punctuation (`प्राणी ।` → `प्राणी।`), are corrected. Adobe predefined CIDFont collections decode through the documented CID → Unicode path (ISO 32000-1 §9.3.3).

### Changed

- **Performance — O(n²) and O(n·m) hot-path removals** — drop-cap initial pairing uses a windowed binary search; the rotated-character filter is skipped entirely on unrotated pages; and table filtering, XY-cut, hyphen merging, and word extraction lose their quadratic hot paths. Text/Markdown/HTML output is unchanged by these changes.
- **Redundant clip-mask clone dropped in `apply_pending_clip` (#654)** — the render path no longer clones the clip mask when it is about to be replaced, trimming an allocation per clipped paint. Pixel output is sub-perceptually unchanged. Thanks @RayVR.

## [0.3.64] - 2026-06-12

> Composite-CJK page rendering — a bundled Droid Sans fallback now paints embed-less Type 0 fonts and the Adobe predefined CIDFont collections — plus a §11 transparency compositing surface with optional lcms2 colour management, cross-document font-cache correctness, valid annotation appearance streams, and math/CJK text-extraction polish (prime-notation spacing, signed unit exponents, CJK bracket spacing, table-header Markdown).
Expand Down
8 changes: 4 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ manual_checked_ops = "allow"

[package]
name = "pdf_oxide"
version = "0.3.64"
version = "0.3.65"
# MSRV — driven up from 1.82 for v0.3.38. Transitive deps pulled in
# this release push the floor to 1.88:
# - hybrid-array 0.4.10 (via RustCrypto) → edition 2024 → 1.85
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ cargo install pdf_oxide_mcp # Cargo
<dependency>
<groupId>fyi.oxide</groupId>
<artifactId>pdf-oxide</artifactId>
<version>0.3.60</version>
<version>0.3.65</version>
</dependency>
```

Expand Down
2 changes: 1 addition & 1 deletion csharp/PdfOxide/PdfOxide.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
<!-- NuGet Package Configuration -->
<GeneratePackageOnBuild>false</GeneratePackageOnBuild>
<PackageId>PdfOxide</PackageId>
<Version>0.3.64</Version>
<Version>0.3.65</Version>
<Title>PdfOxide</Title>
<Authors>pdf_oxide Contributors</Authors>
<Company>pdf_oxide Project</Company>
Expand Down
2 changes: 1 addition & 1 deletion go/cmd/install/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ const (
// taken from the build info and THIS constant is irrelevant. That's what
// lets `@latest` just work — each tagged release resolves to its own
// version automatically, without a sed step in release automation.
fallbackVersion = "0.3.64"
fallbackVersion = "0.3.65"
BaseURL = "https://github.com/yfedoseev/pdf_oxide/releases/download"
// cacheSubdir lives under os.UserCacheDir() — XDG_CACHE_HOME on Linux,
// ~/Library/Caches on macOS (Time-Machine-excluded), %LocalAppData% on
Expand Down
6 changes: 3 additions & 3 deletions java/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
namespace verification under oxide.fyi)
artifactId: pdf-oxide (the Maven artifact; matches the
package fyi.oxide.pdf)
version: 0.3.64 (lockstep with Cargo workspace /
version: 0.3.65 (lockstep with Cargo workspace /
js/package.json / .csproj /
pyproject.toml — release-preflight
from v0.3.51 #515 enforces parity)
Expand All @@ -33,7 +33,7 @@

<groupId>fyi.oxide</groupId>
<artifactId>pdf-oxide</artifactId>
<version>0.3.64</version>
<version>0.3.65</version>
<packaging>jar</packaging>

<name>pdf_oxide — Java binding</name>
Expand Down Expand Up @@ -72,7 +72,7 @@
<connection>scm:git:https://github.com/yfedoseev/pdf_oxide.git</connection>
<developerConnection>scm:git:git@github.com:yfedoseev/pdf_oxide.git</developerConnection>
<url>https://github.com/yfedoseev/pdf_oxide</url>
<tag>v0.3.64</tag>
<tag>v0.3.65</tag>
</scm>

<issueManagement>
Expand Down
2 changes: 1 addition & 1 deletion js/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "pdf-oxide",
"version": "0.3.64",
"version": "0.3.65",
"type": "module",
"description": "High-performance PDF parsing and text extraction library — prebuilt native bindings, no build toolchain required",
"main": "lib/index.js",
Expand Down
4 changes: 2 additions & 2 deletions pdf_oxide_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pdf_oxide_cli"
version = "0.3.64"
version = "0.3.65"
edition = "2021"
description = "CLI for pdf-oxide — the fastest PDF toolkit. 22 commands: text extraction, PDF to markdown, search, merge, split, images, compress, encrypt, watermark, forms, and more."
license = "MIT OR Apache-2.0"
Expand Down Expand Up @@ -34,7 +34,7 @@ workspace = true
ocr = ["pdf_oxide/ocr"]

[dependencies]
pdf_oxide = { version = "0.3.64", path = "..", features = ["rendering", "logging"] }
pdf_oxide = { version = "0.3.65", path = "..", features = ["rendering", "logging"] }
clap = { version = "4", features = ["derive"] }
is-terminal = "0.4"
serde_json = "1.0"
Expand Down
7 changes: 7 additions & 0 deletions pdf_oxide_cli/src/cli/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ pub enum Command {
#[arg(long, value_parser = ["plain", "words", "lines", "structured"], default_value = "plain")]
format: String,

/// Column detection for `--format structured` (issue #734):
/// `auto` (heuristic), `two` (force a two-column split for
/// reference-edition layouts the heuristic is conservative about),
/// or `single` (suppress columns). Untagged/geometric pages only.
#[arg(long, value_parser = ["auto", "two", "single"], default_value = "auto")]
column_mode: String,

/// Specific area to extract from as x,y,width,height (points)
#[arg(long)]
area: Option<String>,
Expand Down
9 changes: 8 additions & 1 deletion pdf_oxide_cli/src/cli/commands/text.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ use std::path::Path;
pub fn run(
file: &Path,
format: &str,
column_mode: &str,
area: Option<&str>,
pages: Option<&str>,
output: Option<&Path>,
Expand All @@ -25,9 +26,15 @@ pub fn run(
// assignment), so it is emitted as JSON regardless of the `--json` flag and
// ignores `--area` (it operates on the whole page).
if format == "structured" {
// clap restricts `--column-mode` to these three values.
let mode = match column_mode {
"two" => pdf_oxide::ColumnMode::Two,
"single" => pdf_oxide::ColumnMode::Single,
_ => pdf_oxide::ColumnMode::Auto,
};
let mut all_pages = Vec::new();
for &page_idx in &page_indices {
let structured = doc.extract_structured(page_idx)?;
let structured = doc.extract_structured_with_column_mode(page_idx, mode)?;
all_pages.push(serde_json::json!({
"page": page_idx + 1,
"structured": serde_json::to_value(&structured).unwrap(),
Expand Down
14 changes: 12 additions & 2 deletions pdf_oxide_cli/src/cli/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,18 @@ fn dispatch(
Command::Text {
ref file,
ref format,
ref column_mode,
ref area,
} => commands::text::run(file, format, area.as_deref(), pages, output, password, json),
} => commands::text::run(
file,
format,
column_mode,
area.as_deref(),
pages,
output,
password,
json,
),
Command::Paths {
ref file,
ref format,
Expand Down Expand Up @@ -227,7 +237,7 @@ fn run_piped_stdin() -> pdf_oxide::Result<()> {
));
}
let file = std::path::PathBuf::from(&path);
commands::text::run(&file, "plain", None, None, None, None, false)
commands::text::run(&file, "plain", "auto", None, None, None, None, false)
} else {
Err(pdf_oxide::Error::InvalidOperation("No input received on stdin".to_string()))
}
Expand Down
2 changes: 2 additions & 0 deletions pdf_oxide_cli/src/cli/repl.rs
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ fn cmd_text(state: &mut ReplState, args: &str) -> pdf_oxide::Result<()> {
super::commands::text::run(
Path::new(args),
"plain",
"auto",
None,
None,
None,
Expand All @@ -213,6 +214,7 @@ fn cmd_text(state: &mut ReplState, args: &str) -> pdf_oxide::Result<()> {
super::commands::text::run(
&path,
"plain",
"auto",
None,
None,
None,
Expand Down
51 changes: 51 additions & 0 deletions pdf_oxide_cli/tests/structured_format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,54 @@ fn text_structured_is_listed_as_a_valid_format() {
String::from_utf8_lossy(&out.stderr)
);
}

/// `--column-mode single` must suppress all column indices; `--column-mode two`
/// must force a split (≥1 region with `column_index": 1`) on a layout `auto` is
/// conservative about (issue #734 Fix 3).
#[test]
fn text_structured_column_mode_overrides() {
let run = |mode: &str| -> String {
let out = Command::new(bin())
.args(["text", "--format", "structured", "--column-mode", mode])
.arg(fixture("multi_column_table.pdf"))
.output()
.expect("run pdf-oxide");
assert!(
out.status.success(),
"--column-mode {mode} failed; stderr: {}",
String::from_utf8_lossy(&out.stderr)
);
String::from_utf8_lossy(&out.stdout).into_owned()
};

// single: every column_index null.
let single = run("single");
assert!(
!single.contains("\"column_index\": 0") && !single.contains("\"column_index\": 1"),
"column-mode single must null all column indices: {single}"
);

// two: at least one region forced into the right column.
let two = run("two");
assert!(
two.contains("\"column_index\": 1"),
"column-mode two must force a two-column split: {two}"
);
}

/// Guard the clap value_parser: an unknown `--column-mode` is rejected.
#[test]
fn text_rejects_unknown_column_mode() {
let out = Command::new(bin())
.args([
"text",
"--format",
"structured",
"--column-mode",
"diagonal",
])
.arg(fixture("multi_column_table.pdf"))
.output()
.expect("run pdf-oxide");
assert!(!out.status.success(), "unknown --column-mode must be rejected");
}
4 changes: 2 additions & 2 deletions pdf_oxide_jni/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pdf_oxide_jni"
version = "0.3.64"
version = "0.3.65"
edition = "2021"
description = "JNI bindings for pdf_oxide — native Java binding, the 8th surface alongside Python/Go/JS/C#/WASM/CLI/MCP. Loaded by the fyi.oxide:pdf-oxide Maven artifact."
license = "MIT OR Apache-2.0"
Expand Down Expand Up @@ -93,7 +93,7 @@ jni = "0.22"
# opt-in FIPS 140-3 build) — those two are compile-time mutually
# exclusive (pdf_oxide enforces via compile_error!). We always
# enable `icc` for ICC-based colour management.
pdf_oxide = { version = "0.3.64", path = "..", default-features = false, features = ["icc"] }
pdf_oxide = { version = "0.3.65", path = "..", default-features = false, features = ["icc"] }

# JSON envelope for the v0.3.51 AutoExtractor rich-result path. The
# Java side gets the PageExtraction / DocumentExtraction as a JSON
Expand Down
4 changes: 2 additions & 2 deletions pdf_oxide_mcp/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "pdf_oxide_mcp"
version = "0.3.64"
version = "0.3.65"
edition = "2021"
description = "MCP server for PDF extraction — gives Claude, Cursor, and AI assistants the ability to read PDFs locally. Text, markdown, and HTML output. Powered by pdf_oxide."
license = "MIT OR Apache-2.0"
Expand All @@ -19,7 +19,7 @@ path = "src/main.rs"
workspace = true

[dependencies]
pdf_oxide = { version = "0.3.64", path = ".." }
pdf_oxide = { version = "0.3.65", path = ".." }
serde_json = "1.0"

[dev-dependencies]
Expand Down
Loading
Loading