release: v0.3.65 — multilingual & layout extraction quality by yfedoseev · Pull Request #739 · yfedoseev/pdf_oxide

yfedoseev · 2026-06-16T03:49:07Z

v0.3.65 — Multilingual & layout extraction quality

Right-to-left bidi reconstruction (Arabic/Hebrew), multi-region reading order (publisher sidebars + two-column academic), CJK/Indic word segmentation, an in-house CCITT Group 4 fax decoder, structured two-column surfacing, and a batch of O(n²) hot-path removals.

Added

Two-column structured extraction + tagged-structure surfacing — per-line column_index, Lbl→marginal label, Sect/Art/Part→section_id (ISO 32000-1 §14.8.4).
Reading-order threads — /Threads→/B bead ordering for content that flows across columns/pages.
In-house CCITT Group 4 (T.6) fax decoder — honours EncodedByteAlign, partial-row recovery; replaces a silent all-white fallback.

Fixed

RTL Arabic/Hebrew logical-order reconstruction — cross-span cluster reversal (الثدييات), RTL number preservation (٤٣٤١→١٤٣٤, ל ,2009-→ל-2009,), glyph-advance preservation on /ReversedChars producers.
Multi-region reading order — publisher-sidebar segregation across text/md/html, bottom-spanning block peel, rowspan-label fix, two-column prose linearisation, figure /BBox clip (gated to figure-sized forms so full-page wrappers keep their body).
CJK/Indic word segmentation — Korean number/counter spacing + line-break rejoin, stray spaces before Bengali/Devanagari/Latin punctuation, Adobe predefined CIDFont CID→Unicode (§9.3.3).

Changed

Performance — O(n²)/O(n·m) hot-path removals (drop-cap pairing, rotated-char filter, table filter, XY-cut, hyphen merge, word extraction); output unchanged.
Redundant clip-mask clone dropped in apply_pending_clip — sub-perceptual pixel change.

Verification

v0.3.64→v0.3.65 release regression sweep: 301 PDFs, 0 regressions (all diffs are improvements or benign laterals); google_doc_document.pdf table GUARD byte-identical.
Version bumped 0.3.64→0.3.65 across all binding manifests + parity tests.

Contributors

Thanks @RayVR (#654), @potatochipcoconut (#738), @lggcs (#734).

Closes #738
Closes #734
Closes #458
Ref: #654

…cing; #458 threads Corpus-validated (v0.3.64 vs HEAD, 156-PDF sweep). Ships the parts that are spec-correct and byte-identical-safe; the automatic text/md/html column-major reorder is REVERTED — the sweep proved the per-flow heuristics regress real two-column PDFs (arxiv references scrambled, a road data-sheet form glued, a TOC's page numbers detached), the same wall the in-partition band-peel hit 3×. text/md/html output is byte-identical to v0.3.64 (verified on the three regressors). The correct fix is a single recursive XY-cut column-region pass (`page_reading_order`) consumed by ALL flows; tracked as a follow-up and gated on the corpus (KJF text/md/html targets are #[ignore]d until it lands). Shipped (corpus-safe / additive): * structured `extract_structured`: populate `column_index` for two-column bodies (gutter-bridge-aware detector) and group one region per column in column-major order (#734 §1/§2); best-effort, structured-only. * tagged structure surfacing (§14.8.4): `Lbl` → MarginalLabel (#734 §4); nearest `Sect`/`Art`/`Part` → document-stable `section_id` on each region, giving cross-page chapter continuity (#734 §5) and spillover grouping (#734 §6). Additive — empty for untagged/suspect PDFs, so their output is unchanged. traversal.rs tracks the nearest Sect/Art/Part ancestor per MCID. * ColumnMode {Auto,Two,Single} via MCP `column_mode` + CLI `--column-mode`. * #458 article threads: page_article_bead_rects gate (≥2 beads ∧ ≥80% coverage ∧ ≥2 x-bands ∧ order-divergence) wired into page_reading_order, gated on !has_structure_tree; order-divergence keeps single-column threads identical. Reverted (corpus-regressing): per-flow column-major in assemble_text_from_spans, to_markdown/to_html (split-convert), and the gutter/cmp/split helpers. Local guard battery + corpus diff: text/md/html byte-identical to v0.3.64 on the 3 regressors; structured/tagged/threads tests green.

…i, reading order, word-seg Corpus-verified 0-regression (156-PDF byte-sweep + per-doc CER vs pdf_benches golds): - RTL Arabic/Hebrew: number-run preservation on visual->logical reversal (UAX#9 L2, reverse_rtl_keep_numbers); right-joining joining-type discriminator for interior-space strip; density-gated de-shatter of glyph-exploded words; and the cross-span GLYPH INTERLEAVE repair (merge_interleaved_rtl_lines): producer-drawn zero-width mark/consonant spans whose x falls inside a body span are re-collapsed per visual line using the producer's standalone-space word boundaries, fixing al-thadyiyaat-class scrambles. wiki-cat-ar CER 0.134 -> 0.079; gated so already-correct RTL (BidiSample, ArabicCIDTrueType, hebrew_mirrored: zero width-0 spans) is byte-identical. - RW-1 multi-region reading order: sidebar_body_reading_order band-first emit for narrow-sidebar+wide-body pages (PMC title no longer shattered along the body gutter). - Two-column reading order (attempt #5), Korean over-seg, Bengali/Hindi clause-punct hug, CID-Tw single-byte gating (#9.3.3). Hand-built in-code fixtures (no third-party PDFs); RTL/reading-order guard battery green.

… word-boundary fixes RW-1e (real-academic): honor the ISO 32000-1 §8.10.1 Form XObject /BBox clip in process_xobject — drop spans the form paints OUTSIDE its /BBox. pdfTeX figure forms embed a 'FOR PEER REVIEW' draft-galley page outside the figure BBox that a conformant renderer clips; pdf_oxide was emitting it as duplicate body text. PMC8103263 CER 0.510->0.155 (pymupdf parity), dup_ratio 0.254->0.025. Allocation- free fast path keeps form-heavy docs fast. Matches pymupdf's verified behaviour. SEG-HE: number-preserving RTL visual->logical reversal of a neutral+single-digit span in a pure-RTL run (' ,2009-' -> '-2009, '); wiki-cat-he wJacc 0.873->1.000. SEG-AR: a space bordering non-cursive clause punctuation is a real word break, not a cursive shatter space; wiki-cat-ar wJacc 0.574->0.600. 156-PDF byte-sweep: all RTL fixtures (BidiSample/ArabicCIDTrueType/hebrew_mirrored/ PDFBOX-4531/issue10301/issue18117) byte-identical; only deterministic corpus change is one benign same-length arxiv reorder. Hand-built in-code fixtures.

… pages span_overlaps_rotated_chars drops a span only when its nearest char is rotated (>=5deg); on a page with NO rotated char (the overwhelming majority) the per-span nearest-char scan ran O(spans*chars) only to never drop anything. Gate the whole retain behind one O(chars) precheck — byte-identical output, removes the quadratic on every unrotated page of the PDF->IR (docx/pptx/xlsx/round-trip) path.

…rge, word extraction All byte-identical (the 156-PDF byte-sweep shows ZERO output change vs the prior binary; only the pre-existing RW-1e arxiv reorder differs). - PERF-1: untagged table-ownership filter (document.rs) — per-span containment test was O(spans*cells) (~2e7 AABB tests on dense table pages). Index cell bboxes into coarse y-bands once; a span only probes cells in its y-band. A containing cell always shares the span's y-band, so identical to the full scan. - PERF-6: is_single_column_region (xycut.rs) — two O(k^2) within-tolerance cluster scans over per-region lines/gaps replaced by sort + partition_point binary search (O(k log k)); max-count / any-cluster are multiset properties. - PERF-5: merge_hyphenated_spans (docx/pptx/xlsx) — Vec::remove(i+1) in a loop with no-advance-on-merge was O(n^2); rewritten as one forward pass with a running accumulator that chains exactly as before. - PERF-7: extract_text_as_words / _with_custom_gaps (document.rs) — to_chars() was materialized twice per span; materialize once and reuse.

…order reorder_rowspan_labels promoted a numbered reference/figure-legend marker column ('1.', '2.', ...) as if it were a multi-row rowspan label, hoisting the markers out of reading order. Detect a vertical numbered list (>=3 markers in a tight left-edge cluster spread over >=3 rows) and exclude those markers from label promotion, keeping the legend adjacent to its figure title instead of scrambled into the body. Corpus: 155/156 byte-identical; the one change is a pure reorder (zero content loss) of a scanned figure-legend, keeping it with its caption. TDD: test_rowspan_skips_numbered_reference_continuation (and the existing genuine-rowspan promotion test still passes).

…search select_drop_cap_initials rescanned every span for each oversized initial to find the nearest body continuation (O(initials*spans)). Pre-sort span indices by left edge once and probe only the narrow candidate x-window [init_right - max_fs*0.5, init_right + max_fs*0.12] via partition_point; the exact per-candidate gap test is unchanged, so output is byte-identical. Completes the O(n^2) hotspot sweep (PERF-2, the drop-cap item deferred from 3352b2a). Corpus: 156/156 byte-identical on the regression sweep.

…RTL Arabic Arabic producers that emit glyphs via /ReversedChars + per-glyph /ActualText (ISO 32000-1 §14.8.2.3.3 / §14.9.4) reposition glyphs out of advance-order. When should_merge glued adjacent Arabic glyph spans, to_chars() reconstructed each glyph by cumulative font advance from the span's left edge, discarding the producer's true positions: e.g. lam/alif of القهوة landed at advance-x 539/542 instead of their true 548/552, so the zero-width qaf (true x 543) sorted between them and the RTL visual-order pass emitted قالهوة. After merging, stretch the advance leading into the merged-in span's first glyph so to_chars() reconstructs it at span.bbox.x, keeping per-glyph positions truthful for merge_rtl_line_to_visual_span's ascending-x sort. Gated to Arabic so Latin/CJK stay byte-identical. Result: القهوة/استهلاكًا/شائعة now correct (matches pdfium, the reference that gets this fixture perfect; pymupdf scrambles it). wiki-cat-ar text CER 0.079->0.004, wJacc 0.600->0.975; arabic-structured body de-scrambled (CER 0.092). 156-PDF byte-sweep: zero changes (all 6 RTL fixtures byte-identical) — the adjust is ~0 whenever advance already equals the true gap, so only genuinely scrambled producers are touched.

…r columns reorder_column_major_with_bands buffered a bottom-left References block (below both columns) into the left-column partition, so it printed BEFORE the entire right column instead of last. Peel any block lying a full line-height below the opposite column's bottom into a trailing group emitted after both columns at its own y. Guarded: only fires when the opposite column has real content (>=2 spans) and the block clears its bottom by a line, so balanced 2-col bodies (columns ending at ~equal y) are byte-identical. academic-2col: References now reads LAST (was mid-document). 156-PDF byte-sweep: zero changes (arxiv/SF1199A/issue1905/KJF byte-identical).

…span-level split sidebar_body_reading_order clustered spans into baseline LINES first, but a publisher-metadata sidebar (MDPI/Frontiers/PMC) shares baselines with the body (the narrow left metadata column interleaves with body lines by Y), so each line fused sidebar+body into one full-width band and the sidebar never separated. These PDFs also carry NO background tint to anchor the sidebar (confirmed: zero drawings behind it), so geometry alone is indistinguishable from a label:value form. Classify per SPAN by the gutter instead of per line, and gate the reorder on a semantic anti-form discriminator: the sidebar column must carry >=2 distinct publisher-furniture labels (Citation/Received/Accepted/Published/Copyright/ Licensee/Academic Editor/Creative Commons/…) — furniture that never heads a form field or body column. Emit body (+ any true full-width band) top-to-bottom, then the sidebar last. PMC8103263 extract_text: body now reads contiguously (title→abstract→intro→body) with all metadata furniture at the end (was interleaved baseline-by-baseline). 156-PDF byte-sweep: ZERO changes — the >=2-label gate fires on no corpus PDF, so forms (SF1199A), N-up spreads, and narrow-column pages are byte-identical.

…arkdown + html The plain-text path orders via extract_spans (which applies sidebar_body_reading_order), but to_markdown_inner / to_html_inner re-derive reading order through the pipeline afterward, re-interleaving the metadata column. Apply the same sidebar gate on those paths and preserve its order (via preserve_input_order) on untagged pages, mirroring the two-column-prose handling. A trustworthy struct tree's mcid order still wins. PMC8103263 markdown/html: metadata sidebar (Citation/Received/Copyright/…) now reads after the body, matching the plain-text path. 156-PDF byte-sweep: zero changes (the >=2-furniture-label gate fires on no corpus PDF).

…dChars Arabic Producers that draw RTL glyphs individually under /ReversedChars (ISO 32000-1 §14.8.2.3.3) mark real word boundaries with EXPLICIT space glyphs and never encode them as inter-glyph gaps (confirmed: arabic-structured + wiki-cat-ar emit zero literal-space Tj strings; spaces are F5 <0003> glyphs). oxide's geometric space detector then inserted spurious spaces between cursively adjacent Arabic letters, shattering words (إسبريسو -> إس بر يسو). Track a per-page saw_reversed_chars flag (set on /ReversedChars BMC) and, on such pages, suppress a GEOMETRIC space between two Arabic letters — explicit space glyphs (whitespace-only spans) still segment words. Gated to ReversedChars pages so ordinary geometric-spaced Arabic producers are untouched. arabic-structured: إسبريسو/كابتشينو/قهوة مقطّرة/والمناطق now whole; text CER 0.092->0.073, wJacc 0.488->0.657. wiki-cat-ar held at CER 0.004 / wJacc 0.975. 156-PDF byte-sweep: zero changes (gate fires on no corpus PDF).

…l_order Records why the markdown/html RTL path mis-orders an interleaved zero-width glyph (sorts whole spans by x, cannot place a glyph intra-word) and points to the glyph-level text-path reconstruction as the fix, so a future dev finds the root cause without re-deriving it. Comment-only; no behavior change.

…Align CCITT G4 fax images rendered blank because (1) /EncodedByteAlign (common in fax scanners) was parsed but never applied — the third-party fax crate has no such hook and its bit reader is private — so byte-padded rows decoded to garbage, and (2) on any decode failure the code returned an all-white buffer (mean=255, std=0) reported as success, masking the failure as a blank page. Replace the fax-crate image path with an in-house ITU-T T.6 (Group 4) decoder (src/decoders/ccitt.rs): MSB-first bit reader, the verified 2D mode + Modified- Huffman run tables (generated from T.4, asserted prefix-free), the reference- line changing-element walk ported from the verified fax-crate logic, plus the two gaps the crate could not express: - EncodedByteAlign: skip zero fill to the next byte boundary between rows (pdfium-guarded — a 1 in the fill region disables alignment rather than corrupting every subsequent row). - Partial recovery: keep the rows decoded before a truncation/damage and white-pad the tail, instead of discarding the whole page. Zero decodable rows returns Err. decompress_ccitt routes to the in-house decoder first, falls back to the fax crate (Group 3, or G4 streams it declines), and only as a last resort emits a LOUDLY-WARNED blank — never a silent one. Validated against a pymupdf/pdfium oracle: byte-aligned G4 (with /Rows, no /Rows, tall pages, /BlackIs1) now decodes pixel-identical to the oracle (was corrupt/blank); a truncated stream recovers partial content (was 255/0 blank); real CCITT fixtures (issue_395 9pp, 538250) decode byte-identical to the prior fax path (no regression, no crash). Unit tests cover V0/Horizontal/byte-align/ zero-rows + table prefix-freeness.

The RW-1e clip (ISO 32000-1 §8.10.1) dropped text a Form XObject paints outside its /BBox. That is correct for a figure whose embedded source PDF retained a draft-galley underlay, but wrong for a full-page content-frame wrapper whose declared BBox happens to exclude real body text: a conformant renderer clips it, yet every text extractor (poppler/pdftotext, the common reference) keeps it — and for a wrapper it is the body's only copy. Discriminate by coverage: a figure occupies a sub-region of the page; a wrapper covers most of it. Only apply the clip when the mapped clip rect is figure-sized (< 60% of page MediaBox area). Measured figures cover <=27% of the page; the regressing wrapper covered 82%. The galley-dedup win is kept (figure forms still clip) while page-wrapper bodies are preserved. Found in the v0.3.64->v0.3.65 release regression sweep (301 PDFs): a full-page-wrapper page lost its body (2726->115 bytes). Post-fix the page is byte-identical to v0.3.64 and the dedup fixture is unchanged; the rest of the 301-PDF sweep is byte-identical to the pre-fix candidate (zero collateral).

CHANGELOG 0.3.65 (2026-06-16): multilingual & layout extraction quality — RTL Arabic/Hebrew bidi reconstruction, multi-region reading order (publisher sidebars, two-column academic), CJK/Indic word segmentation — plus an in-house CCITT Group 4 fax decoder (#738), two-column structured surfacing (#734), reading-order threads (#458), and O(n^2) hot-path removals. Bumps 0.3.64 -> 0.3.65 in every binding manifest/version/parity test (Cargo workspace cli/mcp/jni, Cargo.lock, pyproject already at 0.3.65, js/wasm package.json, java pom, ruby, php, csharp, go, python/rust parity tests) and the README Maven example.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Release v0.3.65 focuses on improving multilingual extraction quality (RTL bidi, CJK/Indic segmentation), layout-aware reading order (two-column + sidebars + threads), and performance, alongside a new in-house CCITT Group 4 decoder and version parity bumps across bindings.

Changes:

Adds/extends extraction features: structured two-column regions (column_index), tagged structure surfacing (Lbl→role, Sect/Art/Part→section_id), article-thread reading order, and Form /BBox clipping.
Improves RTL and word segmentation: digit-preserving RTL reversal, Arabic joining-type handling, Hangul mid-word wrap rejoin, Indic punctuation hugging.
Removes hot-path O(n²) patterns and bumps versions/docs/tests to 0.3.65.

Reviewed changes

Copilot reviewed 53 out of 55 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
wasm-pkg/package.json	Bumps wasm package version to 0.3.65.
tests/v0365_targets_and_locks.rs	Adds regression targets/locks for reading order + RTL behavior.
tests/seg_ko_functional.rs	Adds end-to-end Korean segmentation regression tests.
tests/seg_indic_functional.rs	Adds end-to-end Indic punctuation hugging regression tests.
tests/rw1e_form_bbox_clip.rs	Adds regression tests for Form XObject `/BBox` clipping in extraction.
tests/issue_734_two_column_text_order.rs	Adds integration tests for two-column reading order + structured regions.
tests/issue_734_tagged_structure.rs	Adds tests for tagged structure surfacing (`Lbl`, section continuity).
tests/issue_458_article_threads.rs	Adds tests for parsing/applying `/Threads` bead ordering.
tests/full_width_header_columns_md.rs	Updates prior ignore/commentary to reflect #734 fix being in place.
tests/core_parity.rs	Updates version-parity assertion to 0.3.65.
src/text/rtl_detector.rs	Adds Arabic right-joining letter classifier + tests.
src/text/bidi.rs	Adds digit-preserving RTL reversal helper + tests.
src/structured.rs	Adds `section_id`, column mode override API, and improved gutter detection/grouping.
src/structure/traversal.rs	Threads `section_id` through structure traversal output.
src/pipeline/reading_order/xycut.rs	Replaces O(k²) clustering logic with sort + window counting.
src/pipeline/reading_order/mod.rs	Adds `preserve_input_order` to avoid re-sorting already-reordered spans.
src/pipeline/page_order.rs	Wires article-thread bead ordering into default reading-order path with gates.
src/pipeline/mod.rs	Implements preserve-input-order fast path in the text pipeline.
src/lib.rs	Exports `ColumnMode` in the public API.
src/extractors/text.rs	Improves RTL handling, fixes Tw application for multibyte codes, adds Form `/BBox` clip logic.
src/extractors/ccitt_bilevel.rs	Uses in-house CCITT G4 decoder with fallback + changes bilevel helper visibility.
src/document.rs	Adds two-column prose reorder, sidebar-body reorder, RTL fixes, segmentation fixes, and perf wins.
src/decoders/mod.rs	Exposes `ccitt` module internally for decoder use.
src/decoders/ccitt.rs	Replaces pass-through-only CCITT module docs with in-house G4 decoder + tests.
src/converters/xlsx_layout.rs	Removes O(n²) hyphen-merge by using single-pass accumulator.
src/converters/pptx_layout.rs	Removes O(n²) hyphen-merge by using single-pass accumulator.
src/converters/pdf_to_ir.rs	Avoids quadratic rotated-char filtering when no rotated chars exist.
src/converters/docx_layout.rs	Removes O(n²) hyphen-merge by using single-pass accumulator.
ruby/spec/core_parity_spec.rb	Updates Ruby binding version parity check to 0.3.65.
ruby/spec/cdylib_smoke_spec.rb	Updates Ruby cdylib smoke version check to 0.3.65.
ruby/lib/pdf_oxide/version.rb	Bumps Ruby gem version constant to 0.3.65.
python/tests/test_core_parity.py	Updates Python binding version parity check to 0.3.65.
pyproject.toml	Bumps Python package version to 0.3.65.
php/tests/Integration/CoreParityTest.php	Updates PHP binding version parity check to 0.3.65.
php/src/Pdf.php	Bumps PHP version constant to 0.3.65.
php/scripts/download-native-lib.php	Bumps PHP installer defaults/user-agent to v0.3.65.
pdf_oxide_mcp/src/protocol.rs	Adds MCP schema option `column_mode` for structured extraction.
pdf_oxide_mcp/src/extract.rs	Implements MCP `column_mode` validation and passes it to structured extraction.
pdf_oxide_mcp/Cargo.toml	Bumps MCP crate + dependency version to 0.3.65.
pdf_oxide_jni/Cargo.toml	Bumps JNI crate + dependency version to 0.3.65.
pdf_oxide_cli/tests/structured_format.rs	Adds CLI tests for `--column-mode` override semantics.
pdf_oxide_cli/src/cli/repl.rs	Plumbs default `column_mode=auto` through REPL text command.
pdf_oxide_cli/src/cli/mod.rs	Plumbs `column_mode` through CLI dispatch and stdin path.
pdf_oxide_cli/src/cli/commands/text.rs	Implements `--column-mode` mapping for structured extraction.
pdf_oxide_cli/src/cli/args.rs	Adds `--column-mode` CLI option for structured format.
pdf_oxide_cli/Cargo.toml	Bumps CLI crate + dependency version to 0.3.65.
js/package.json	Bumps JS package version to 0.3.65.
java/pom.xml	Bumps Java binding version/tag to 0.3.65.
go/cmd/install/main.go	Bumps Go installer fallback version to 0.3.65.
csharp/PdfOxide/PdfOxide.csproj	Bumps NuGet package version to 0.3.65.
README.md	Updates Maven dependency snippet version to 0.3.65.
Cargo.toml	Bumps core crate version to 0.3.65.
CHANGELOG.md	Adds v0.3.65 changelog entry detailing features/fixes/perf.

Comments suppressed due to low confidence (6)

src/extractors/ccitt_bilevel.rs:1

The blank-image fallback size is now derived from params.rows, but params.rows can be None even when the caller provided height_opt (e.g., image dictionary has Height but DecodeParms omits Rows). This can under-allocate the output (often to 1 row), producing an invalid/incorrect image buffer. Use height_opt as the primary fallback height (as before), and only fall back to params.rows (or vice versa) in a way that preserves the known image height.
src/pipeline/page_order.rs:1
parse_article_threads(doc) is invoked inside page_article_bead_rects, which is called per page in the default reading-order path. If thread parsing walks a significant portion of the document structure, this can become an avoidable O(pages × parse_cost) overhead. Consider caching parsed threads at the PdfDocument level (or in a per-document lazy cache) and reusing them across pages, then filtering to the current page's beads.
src/pipeline/page_order.rs:1
The order-divergence gate compares only x and y when checking whether bead order matches geometric order. If two beads share the same (x, y) but differ in width/height (or if floating-point representation differs slightly), the gate can incorrectly treat distinct beads as equal and suppress thread application. Compare the full rectangle identity (x, y, width, height) with exact or tolerant comparison, or use a stable bead identifier if available.
src/document.rs:1
Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
src/document.rs:1
Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
src/document.rs:1
Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The v0.3.65 batch landed without a full local CI pass (shared-box build contention); this clears every failure the PR surfaced. - rustfmt: format ccitt.rs, document.rs, ccitt_bilevel.rs, text.rs, rtl_detector.rs, and the v0365 lock test (committed unformatted). - clippy (-D warnings): manual_range_contains, manual_clamp, useless_conversion, useless_vec (document.rs); manual_is_multiple_of (ccitt.rs); doc_lazy_continuation x2 (structured.rs); unnecessary_map_or -> is_some_and (text.rs RW-1e gate). - rw1_full_width_title_reads_contiguously: the D3 sidebar-segregation rewrite added an anti-false-positive gate requiring >=2 DISTINCT furniture labels, but the fixture's sidebar was 8x 'Citation' (one distinct label) and the title words were spread with gaps that split at the gutter, so sidebar_body_reading_order never engaged. Make the fixture realistic: a single full-width title run plus eight distinct furniture labels (Citation/Received/Accepted/...), keeping the span count above the classifier's 30-span floor. Locks the real behaviour; zero production code change.

cargo doc --no-deps -D warnings (stable) rejected two intra-doc links from public items to non-public targets: - CcittFaxDecoder doc linked [`decode`] (a pub(crate) fn) — private-intra-doc-links - ColumnMode doc linked [`build_structured_page_with_mode`] (pub(crate) fn) — broken-intra-doc-links (out of public doc scope) Demote both to plain code spans; they were descriptive, not navigational.

yfedoseev added 17 commits June 12, 2026 23:38

release: bump version to 0.3.65

b9a2dbd

yfedoseev requested a review from Copilot June 16, 2026 03:55

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Copilot started reviewing on behalf of yfedoseev June 16, 2026 04:25 View session

yfedoseev added 2 commits June 15, 2026 21:32

yfedoseev merged commit fd16e75 into main Jun 16, 2026
241 checks passed

yfedoseev deleted the release/v0.3.65 branch June 16, 2026 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release: v0.3.65 — multilingual & layout extraction quality#739

release: v0.3.65 — multilingual & layout extraction quality#739
yfedoseev merged 19 commits into
mainfrom
release/v0.3.65

yfedoseev commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yfedoseev commented Jun 16, 2026

v0.3.65 — Multilingual & layout extraction quality

Added

Fixed

Changed

Verification

Contributors

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants