Skip to content

release: v0.3.65 — multilingual & layout extraction quality#739

Merged
yfedoseev merged 19 commits into
mainfrom
release/v0.3.65
Jun 16, 2026
Merged

release: v0.3.65 — multilingual & layout extraction quality#739
yfedoseev merged 19 commits into
mainfrom
release/v0.3.65

Conversation

@yfedoseev

Copy link
Copy Markdown
Owner

v0.3.65 — Multilingual & layout extraction quality

Right-to-left bidi reconstruction (Arabic/Hebrew), multi-region reading order (publisher sidebars + two-column academic), CJK/Indic word segmentation, an in-house CCITT Group 4 fax decoder, structured two-column surfacing, and a batch of O(n²) hot-path removals.

Added

  • Two-column structured extraction + tagged-structure surfacing — per-line column_index, Lbl→marginal label, Sect/Art/Partsection_id (ISO 32000-1 §14.8.4).
  • Reading-order threads/Threads/B bead ordering for content that flows across columns/pages.
  • In-house CCITT Group 4 (T.6) fax decoder — honours EncodedByteAlign, partial-row recovery; replaces a silent all-white fallback.

Fixed

  • RTL Arabic/Hebrew logical-order reconstruction — cross-span cluster reversal (الثدييات), RTL number preservation (٤٣٤١١٤٣٤, ל ,2009-ל-2009,), glyph-advance preservation on /ReversedChars producers.
  • Multi-region reading order — publisher-sidebar segregation across text/md/html, bottom-spanning block peel, rowspan-label fix, two-column prose linearisation, figure /BBox clip (gated to figure-sized forms so full-page wrappers keep their body).
  • CJK/Indic word segmentation — Korean number/counter spacing + line-break rejoin, stray spaces before Bengali/Devanagari/Latin punctuation, Adobe predefined CIDFont CID→Unicode (§9.3.3).

Changed

  • Performance — O(n²)/O(n·m) hot-path removals (drop-cap pairing, rotated-char filter, table filter, XY-cut, hyphen merge, word extraction); output unchanged.
  • Redundant clip-mask clone dropped in apply_pending_clip — sub-perceptual pixel change.

Verification

  • v0.3.64→v0.3.65 release regression sweep: 301 PDFs, 0 regressions (all diffs are improvements or benign laterals); google_doc_document.pdf table GUARD byte-identical.
  • Version bumped 0.3.64→0.3.65 across all binding manifests + parity tests.

Contributors

Thanks @RayVR (#654), @potatochipcoconut (#738), @lggcs (#734).


Closes #738
Closes #734
Closes #458
Ref: #654

yfedoseev added 17 commits June 12, 2026 23:38
…cing; #458 threads

Corpus-validated (v0.3.64 vs HEAD, 156-PDF sweep). Ships the parts that are
spec-correct and byte-identical-safe; the automatic text/md/html column-major
reorder is REVERTED — the sweep proved the per-flow heuristics regress real
two-column PDFs (arxiv references scrambled, a road data-sheet form glued, a
TOC's page numbers detached), the same wall the in-partition band-peel hit 3×.
text/md/html output is byte-identical to v0.3.64 (verified on the three
regressors). The correct fix is a single recursive XY-cut column-region pass
(`page_reading_order`) consumed by ALL flows; tracked as a follow-up and gated
on the corpus (KJF text/md/html targets are #[ignore]d until it lands).

Shipped (corpus-safe / additive):
* structured `extract_structured`: populate `column_index` for two-column
  bodies (gutter-bridge-aware detector) and group one region per column in
  column-major order (#734 §1/§2); best-effort, structured-only.
* tagged structure surfacing (§14.8.4): `Lbl` → MarginalLabel (#734 §4); nearest
  `Sect`/`Art`/`Part` → document-stable `section_id` on each region, giving
  cross-page chapter continuity (#734 §5) and spillover grouping (#734 §6).
  Additive — empty for untagged/suspect PDFs, so their output is unchanged.
  traversal.rs tracks the nearest Sect/Art/Part ancestor per MCID.
* ColumnMode {Auto,Two,Single} via MCP `column_mode` + CLI `--column-mode`.
* #458 article threads: page_article_bead_rects gate (≥2 beads ∧ ≥80% coverage
  ∧ ≥2 x-bands ∧ order-divergence) wired into page_reading_order, gated on
  !has_structure_tree; order-divergence keeps single-column threads identical.

Reverted (corpus-regressing): per-flow column-major in assemble_text_from_spans,
to_markdown/to_html (split-convert), and the gutter/cmp/split helpers.

Local guard battery + corpus diff: text/md/html byte-identical to v0.3.64 on the
3 regressors; structured/tagged/threads tests green.
…i, reading order, word-seg

Corpus-verified 0-regression (156-PDF byte-sweep + per-doc CER vs pdf_benches golds):

- RTL Arabic/Hebrew: number-run preservation on visual->logical reversal (UAX#9 L2,
  reverse_rtl_keep_numbers); right-joining joining-type discriminator for interior-space
  strip; density-gated de-shatter of glyph-exploded words; and the cross-span GLYPH
  INTERLEAVE repair (merge_interleaved_rtl_lines): producer-drawn zero-width mark/consonant
  spans whose x falls inside a body span are re-collapsed per visual line using the
  producer's standalone-space word boundaries, fixing al-thadyiyaat-class scrambles.
  wiki-cat-ar CER 0.134 -> 0.079; gated so already-correct RTL (BidiSample, ArabicCIDTrueType,
  hebrew_mirrored: zero width-0 spans) is byte-identical.
- RW-1 multi-region reading order: sidebar_body_reading_order band-first emit for
  narrow-sidebar+wide-body pages (PMC title no longer shattered along the body gutter).
- Two-column reading order (attempt #5), Korean over-seg, Bengali/Hindi clause-punct hug,
  CID-Tw single-byte gating (#9.3.3).

Hand-built in-code fixtures (no third-party PDFs); RTL/reading-order guard battery green.
… word-boundary fixes

RW-1e (real-academic): honor the ISO 32000-1 §8.10.1 Form XObject /BBox clip in
process_xobject — drop spans the form paints OUTSIDE its /BBox. pdfTeX figure
forms embed a 'FOR PEER REVIEW' draft-galley page outside the figure BBox that a
conformant renderer clips; pdf_oxide was emitting it as duplicate body text.
PMC8103263 CER 0.510->0.155 (pymupdf parity), dup_ratio 0.254->0.025. Allocation-
free fast path keeps form-heavy docs fast. Matches pymupdf's verified behaviour.

SEG-HE: number-preserving RTL visual->logical reversal of a neutral+single-digit
span in a pure-RTL run (' ,2009-' -> '-2009, '); wiki-cat-he wJacc 0.873->1.000.

SEG-AR: a space bordering non-cursive clause punctuation is a real word break, not
a cursive shatter space; wiki-cat-ar wJacc 0.574->0.600.

156-PDF byte-sweep: all RTL fixtures (BidiSample/ArabicCIDTrueType/hebrew_mirrored/
PDFBOX-4531/issue10301/issue18117) byte-identical; only deterministic corpus change
is one benign same-length arxiv reorder. Hand-built in-code fixtures.
… pages

span_overlaps_rotated_chars drops a span only when its nearest char is rotated
(>=5deg); on a page with NO rotated char (the overwhelming majority) the per-span
nearest-char scan ran O(spans*chars) only to never drop anything. Gate the whole
retain behind one O(chars) precheck — byte-identical output, removes the quadratic
on every unrotated page of the PDF->IR (docx/pptx/xlsx/round-trip) path.
…rge, word extraction

All byte-identical (the 156-PDF byte-sweep shows ZERO output change vs the prior
binary; only the pre-existing RW-1e arxiv reorder differs).

- PERF-1: untagged table-ownership filter (document.rs) — per-span containment
  test was O(spans*cells) (~2e7 AABB tests on dense table pages). Index cell
  bboxes into coarse y-bands once; a span only probes cells in its y-band. A
  containing cell always shares the span's y-band, so identical to the full scan.
- PERF-6: is_single_column_region (xycut.rs) — two O(k^2) within-tolerance
  cluster scans over per-region lines/gaps replaced by sort + partition_point
  binary search (O(k log k)); max-count / any-cluster are multiset properties.
- PERF-5: merge_hyphenated_spans (docx/pptx/xlsx) — Vec::remove(i+1) in a loop
  with no-advance-on-merge was O(n^2); rewritten as one forward pass with a
  running accumulator that chains exactly as before.
- PERF-7: extract_text_as_words / _with_custom_gaps (document.rs) — to_chars()
  was materialized twice per span; materialize once and reuse.
…order

reorder_rowspan_labels promoted a numbered reference/figure-legend
marker column ('1.', '2.', ...) as if it were a multi-row rowspan
label, hoisting the markers out of reading order. Detect a vertical
numbered list (>=3 markers in a tight left-edge cluster spread over
>=3 rows) and exclude those markers from label promotion, keeping the
legend adjacent to its figure title instead of scrambled into the body.

Corpus: 155/156 byte-identical; the one change is a pure reorder (zero
content loss) of a scanned figure-legend, keeping it with its caption.
TDD: test_rowspan_skips_numbered_reference_continuation (and the
existing genuine-rowspan promotion test still passes).
…search

select_drop_cap_initials rescanned every span for each oversized
initial to find the nearest body continuation (O(initials*spans)).
Pre-sort span indices by left edge once and probe only the narrow
candidate x-window [init_right - max_fs*0.5, init_right + max_fs*0.12]
via partition_point; the exact per-candidate gap test is unchanged, so
output is byte-identical. Completes the O(n^2) hotspot sweep (PERF-2,
the drop-cap item deferred from 3352b2a).

Corpus: 156/156 byte-identical on the regression sweep.
…RTL Arabic

Arabic producers that emit glyphs via /ReversedChars + per-glyph
/ActualText (ISO 32000-1 §14.8.2.3.3 / §14.9.4) reposition glyphs out
of advance-order. When should_merge glued adjacent Arabic glyph spans,
to_chars() reconstructed each glyph by cumulative font advance from the
span's left edge, discarding the producer's true positions: e.g. lam/alif
of القهوة landed at advance-x 539/542 instead of their true 548/552, so
the zero-width qaf (true x 543) sorted between them and the RTL
visual-order pass emitted قالهوة.

After merging, stretch the advance leading into the merged-in span's first
glyph so to_chars() reconstructs it at span.bbox.x, keeping per-glyph
positions truthful for merge_rtl_line_to_visual_span's ascending-x sort.
Gated to Arabic so Latin/CJK stay byte-identical.

Result: القهوة/استهلاكًا/شائعة now correct (matches pdfium, the reference
that gets this fixture perfect; pymupdf scrambles it). wiki-cat-ar text
CER 0.079->0.004, wJacc 0.600->0.975; arabic-structured body de-scrambled
(CER 0.092). 156-PDF byte-sweep: zero changes (all 6 RTL fixtures
byte-identical) — the adjust is ~0 whenever advance already equals the
true gap, so only genuinely scrambled producers are touched.
…r columns

reorder_column_major_with_bands buffered a bottom-left References block
(below both columns) into the left-column partition, so it printed BEFORE
the entire right column instead of last. Peel any block lying a full
line-height below the opposite column's bottom into a trailing group
emitted after both columns at its own y. Guarded: only fires when the
opposite column has real content (>=2 spans) and the block clears its
bottom by a line, so balanced 2-col bodies (columns ending at ~equal y)
are byte-identical.

academic-2col: References now reads LAST (was mid-document). 156-PDF
byte-sweep: zero changes (arxiv/SF1199A/issue1905/KJF byte-identical).
…span-level split

sidebar_body_reading_order clustered spans into baseline LINES first, but a
publisher-metadata sidebar (MDPI/Frontiers/PMC) shares baselines with the body
(the narrow left metadata column interleaves with body lines by Y), so each
line fused sidebar+body into one full-width band and the sidebar never
separated. These PDFs also carry NO background tint to anchor the sidebar
(confirmed: zero drawings behind it), so geometry alone is indistinguishable
from a label:value form.

Classify per SPAN by the gutter instead of per line, and gate the reorder on a
semantic anti-form discriminator: the sidebar column must carry >=2 distinct
publisher-furniture labels (Citation/Received/Accepted/Published/Copyright/
Licensee/Academic Editor/Creative Commons/…) — furniture that never heads a
form field or body column. Emit body (+ any true full-width band) top-to-bottom,
then the sidebar last.

PMC8103263 extract_text: body now reads contiguously (title→abstract→intro→body)
with all metadata furniture at the end (was interleaved baseline-by-baseline).
156-PDF byte-sweep: ZERO changes — the >=2-label gate fires on no corpus PDF, so
forms (SF1199A), N-up spreads, and narrow-column pages are byte-identical.
…arkdown + html

The plain-text path orders via extract_spans (which applies
sidebar_body_reading_order), but to_markdown_inner / to_html_inner re-derive
reading order through the pipeline afterward, re-interleaving the metadata
column. Apply the same sidebar gate on those paths and preserve its order
(via preserve_input_order) on untagged pages, mirroring the two-column-prose
handling. A trustworthy struct tree's mcid order still wins.

PMC8103263 markdown/html: metadata sidebar (Citation/Received/Copyright/…) now
reads after the body, matching the plain-text path. 156-PDF byte-sweep: zero
changes (the >=2-furniture-label gate fires on no corpus PDF).
…dChars Arabic

Producers that draw RTL glyphs individually under /ReversedChars (ISO 32000-1
§14.8.2.3.3) mark real word boundaries with EXPLICIT space glyphs and never
encode them as inter-glyph gaps (confirmed: arabic-structured + wiki-cat-ar
emit zero literal-space Tj strings; spaces are F5 <0003> glyphs). oxide's
geometric space detector then inserted spurious spaces between cursively
adjacent Arabic letters, shattering words (إسبريسو -> إس بر يسو).

Track a per-page saw_reversed_chars flag (set on /ReversedChars BMC) and, on
such pages, suppress a GEOMETRIC space between two Arabic letters — explicit
space glyphs (whitespace-only spans) still segment words. Gated to ReversedChars
pages so ordinary geometric-spaced Arabic producers are untouched.

arabic-structured: إسبريسو/كابتشينو/قهوة مقطّرة/والمناطق now whole; text CER
0.092->0.073, wJacc 0.488->0.657. wiki-cat-ar held at CER 0.004 / wJacc 0.975.
156-PDF byte-sweep: zero changes (gate fires on no corpus PDF).
…l_order

Records why the markdown/html RTL path mis-orders an interleaved zero-width
glyph (sorts whole spans by x, cannot place a glyph intra-word) and points to
the glyph-level text-path reconstruction as the fix, so a future dev finds the
root cause without re-deriving it. Comment-only; no behavior change.
…Align

CCITT G4 fax images rendered blank because (1) /EncodedByteAlign (common in
fax scanners) was parsed but never applied — the third-party fax crate has no
such hook and its bit reader is private — so byte-padded rows decoded to
garbage, and (2) on any decode failure the code returned an all-white buffer
(mean=255, std=0) reported as success, masking the failure as a blank page.

Replace the fax-crate image path with an in-house ITU-T T.6 (Group 4) decoder
(src/decoders/ccitt.rs): MSB-first bit reader, the verified 2D mode + Modified-
Huffman run tables (generated from T.4, asserted prefix-free), the reference-
line changing-element walk ported from the verified fax-crate logic, plus the
two gaps the crate could not express:
- EncodedByteAlign: skip zero fill to the next byte boundary between rows
  (pdfium-guarded — a 1 in the fill region disables alignment rather than
  corrupting every subsequent row).
- Partial recovery: keep the rows decoded before a truncation/damage and
  white-pad the tail, instead of discarding the whole page. Zero decodable
  rows returns Err.

decompress_ccitt routes to the in-house decoder first, falls back to the fax
crate (Group 3, or G4 streams it declines), and only as a last resort emits a
LOUDLY-WARNED blank — never a silent one.

Validated against a pymupdf/pdfium oracle: byte-aligned G4 (with /Rows, no
/Rows, tall pages, /BlackIs1) now decodes pixel-identical to the oracle (was
corrupt/blank); a truncated stream recovers partial content (was 255/0 blank);
real CCITT fixtures (issue_395 9pp, 538250) decode byte-identical to the prior
fax path (no regression, no crash). Unit tests cover V0/Horizontal/byte-align/
zero-rows + table prefix-freeness.
The RW-1e clip (ISO 32000-1 §8.10.1) dropped text a Form XObject paints
outside its /BBox. That is correct for a figure whose embedded source PDF
retained a draft-galley underlay, but wrong for a full-page content-frame
wrapper whose declared BBox happens to exclude real body text: a conformant
renderer clips it, yet every text extractor (poppler/pdftotext, the common
reference) keeps it — and for a wrapper it is the body's only copy.

Discriminate by coverage: a figure occupies a sub-region of the page; a
wrapper covers most of it. Only apply the clip when the mapped clip rect is
figure-sized (< 60% of page MediaBox area). Measured figures cover <=27% of
the page; the regressing wrapper covered 82%. The galley-dedup win is kept
(figure forms still clip) while page-wrapper bodies are preserved.

Found in the v0.3.64->v0.3.65 release regression sweep (301 PDFs): a
full-page-wrapper page lost its body (2726->115 bytes). Post-fix the page is
byte-identical to v0.3.64 and the dedup fixture is unchanged; the rest of the
301-PDF sweep is byte-identical to the pre-fix candidate (zero collateral).
CHANGELOG 0.3.65 (2026-06-16): multilingual & layout extraction quality —
RTL Arabic/Hebrew bidi reconstruction, multi-region reading order (publisher
sidebars, two-column academic), CJK/Indic word segmentation — plus an in-house
CCITT Group 4 fax decoder (#738), two-column structured surfacing (#734),
reading-order threads (#458), and O(n^2) hot-path removals.

Bumps 0.3.64 -> 0.3.65 in every binding manifest/version/parity test
(Cargo workspace cli/mcp/jni, Cargo.lock, pyproject already at 0.3.65,
js/wasm package.json, java pom, ruby, php, csharp, go, python/rust parity
tests) and the README Maven example.
@yfedoseev yfedoseev requested a review from Copilot June 16, 2026 03:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Release v0.3.65 focuses on improving multilingual extraction quality (RTL bidi, CJK/Indic segmentation), layout-aware reading order (two-column + sidebars + threads), and performance, alongside a new in-house CCITT Group 4 decoder and version parity bumps across bindings.

Changes:

  • Adds/extends extraction features: structured two-column regions (column_index), tagged structure surfacing (Lbl→role, Sect/Art/Partsection_id), article-thread reading order, and Form /BBox clipping.
  • Improves RTL and word segmentation: digit-preserving RTL reversal, Arabic joining-type handling, Hangul mid-word wrap rejoin, Indic punctuation hugging.
  • Removes hot-path O(n²) patterns and bumps versions/docs/tests to 0.3.65.

Reviewed changes

Copilot reviewed 53 out of 55 changed files in this pull request and generated no comments.

Show a summary per file
File Description
wasm-pkg/package.json Bumps wasm package version to 0.3.65.
tests/v0365_targets_and_locks.rs Adds regression targets/locks for reading order + RTL behavior.
tests/seg_ko_functional.rs Adds end-to-end Korean segmentation regression tests.
tests/seg_indic_functional.rs Adds end-to-end Indic punctuation hugging regression tests.
tests/rw1e_form_bbox_clip.rs Adds regression tests for Form XObject /BBox clipping in extraction.
tests/issue_734_two_column_text_order.rs Adds integration tests for two-column reading order + structured regions.
tests/issue_734_tagged_structure.rs Adds tests for tagged structure surfacing (Lbl, section continuity).
tests/issue_458_article_threads.rs Adds tests for parsing/applying /Threads bead ordering.
tests/full_width_header_columns_md.rs Updates prior ignore/commentary to reflect #734 fix being in place.
tests/core_parity.rs Updates version-parity assertion to 0.3.65.
src/text/rtl_detector.rs Adds Arabic right-joining letter classifier + tests.
src/text/bidi.rs Adds digit-preserving RTL reversal helper + tests.
src/structured.rs Adds section_id, column mode override API, and improved gutter detection/grouping.
src/structure/traversal.rs Threads section_id through structure traversal output.
src/pipeline/reading_order/xycut.rs Replaces O(k²) clustering logic with sort + window counting.
src/pipeline/reading_order/mod.rs Adds preserve_input_order to avoid re-sorting already-reordered spans.
src/pipeline/page_order.rs Wires article-thread bead ordering into default reading-order path with gates.
src/pipeline/mod.rs Implements preserve-input-order fast path in the text pipeline.
src/lib.rs Exports ColumnMode in the public API.
src/extractors/text.rs Improves RTL handling, fixes Tw application for multibyte codes, adds Form /BBox clip logic.
src/extractors/ccitt_bilevel.rs Uses in-house CCITT G4 decoder with fallback + changes bilevel helper visibility.
src/document.rs Adds two-column prose reorder, sidebar-body reorder, RTL fixes, segmentation fixes, and perf wins.
src/decoders/mod.rs Exposes ccitt module internally for decoder use.
src/decoders/ccitt.rs Replaces pass-through-only CCITT module docs with in-house G4 decoder + tests.
src/converters/xlsx_layout.rs Removes O(n²) hyphen-merge by using single-pass accumulator.
src/converters/pptx_layout.rs Removes O(n²) hyphen-merge by using single-pass accumulator.
src/converters/pdf_to_ir.rs Avoids quadratic rotated-char filtering when no rotated chars exist.
src/converters/docx_layout.rs Removes O(n²) hyphen-merge by using single-pass accumulator.
ruby/spec/core_parity_spec.rb Updates Ruby binding version parity check to 0.3.65.
ruby/spec/cdylib_smoke_spec.rb Updates Ruby cdylib smoke version check to 0.3.65.
ruby/lib/pdf_oxide/version.rb Bumps Ruby gem version constant to 0.3.65.
python/tests/test_core_parity.py Updates Python binding version parity check to 0.3.65.
pyproject.toml Bumps Python package version to 0.3.65.
php/tests/Integration/CoreParityTest.php Updates PHP binding version parity check to 0.3.65.
php/src/Pdf.php Bumps PHP version constant to 0.3.65.
php/scripts/download-native-lib.php Bumps PHP installer defaults/user-agent to v0.3.65.
pdf_oxide_mcp/src/protocol.rs Adds MCP schema option column_mode for structured extraction.
pdf_oxide_mcp/src/extract.rs Implements MCP column_mode validation and passes it to structured extraction.
pdf_oxide_mcp/Cargo.toml Bumps MCP crate + dependency version to 0.3.65.
pdf_oxide_jni/Cargo.toml Bumps JNI crate + dependency version to 0.3.65.
pdf_oxide_cli/tests/structured_format.rs Adds CLI tests for --column-mode override semantics.
pdf_oxide_cli/src/cli/repl.rs Plumbs default column_mode=auto through REPL text command.
pdf_oxide_cli/src/cli/mod.rs Plumbs column_mode through CLI dispatch and stdin path.
pdf_oxide_cli/src/cli/commands/text.rs Implements --column-mode mapping for structured extraction.
pdf_oxide_cli/src/cli/args.rs Adds --column-mode CLI option for structured format.
pdf_oxide_cli/Cargo.toml Bumps CLI crate + dependency version to 0.3.65.
js/package.json Bumps JS package version to 0.3.65.
java/pom.xml Bumps Java binding version/tag to 0.3.65.
go/cmd/install/main.go Bumps Go installer fallback version to 0.3.65.
csharp/PdfOxide/PdfOxide.csproj Bumps NuGet package version to 0.3.65.
README.md Updates Maven dependency snippet version to 0.3.65.
Cargo.toml Bumps core crate version to 0.3.65.
CHANGELOG.md Adds v0.3.65 changelog entry detailing features/fixes/perf.
Comments suppressed due to low confidence (6)

src/extractors/ccitt_bilevel.rs:1

  • The blank-image fallback size is now derived from params.rows, but params.rows can be None even when the caller provided height_opt (e.g., image dictionary has Height but DecodeParms omits Rows). This can under-allocate the output (often to 1 row), producing an invalid/incorrect image buffer. Use height_opt as the primary fallback height (as before), and only fall back to params.rows (or vice versa) in a way that preserves the known image height.
    src/pipeline/page_order.rs:1
  • parse_article_threads(doc) is invoked inside page_article_bead_rects, which is called per page in the default reading-order path. If thread parsing walks a significant portion of the document structure, this can become an avoidable O(pages × parse_cost) overhead. Consider caching parsed threads at the PdfDocument level (or in a per-document lazy cache) and reusing them across pages, then filtering to the current page's beads.
    src/pipeline/page_order.rs:1
  • The order-divergence gate compares only x and y when checking whether bead order matches geometric order. If two beads share the same (x, y) but differ in width/height (or if floating-point representation differs slightly), the gate can incorrectly treat distinct beads as equal and suppress thread application. Compare the full rectangle identity (x, y, width, height) with exact or tolerant comparison, or use a stable bead identifier if available.
    src/document.rs:1
  • Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
    src/document.rs:1
  • Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
    src/document.rs:1
  • Switching prev_span from Option<&TextSpan> to Option<TextSpan> forces a full TextSpan clone for every emitted span. TextSpan can carry sizable allocations (e.g., text, char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persist prev_span across MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The v0.3.65 batch landed without a full local CI pass (shared-box build
contention); this clears every failure the PR surfaced.

- rustfmt: format ccitt.rs, document.rs, ccitt_bilevel.rs, text.rs,
  rtl_detector.rs, and the v0365 lock test (committed unformatted).
- clippy (-D warnings): manual_range_contains, manual_clamp,
  useless_conversion, useless_vec (document.rs); manual_is_multiple_of
  (ccitt.rs); doc_lazy_continuation x2 (structured.rs); unnecessary_map_or
  -> is_some_and (text.rs RW-1e gate).
- rw1_full_width_title_reads_contiguously: the D3 sidebar-segregation
  rewrite added an anti-false-positive gate requiring >=2 DISTINCT furniture
  labels, but the fixture's sidebar was 8x 'Citation' (one distinct label)
  and the title words were spread with gaps that split at the gutter, so
  sidebar_body_reading_order never engaged. Make the fixture realistic: a
  single full-width title run plus eight distinct furniture labels
  (Citation/Received/Accepted/...), keeping the span count above the
  classifier's 30-span floor. Locks the real behaviour; zero production
  code change.
cargo doc --no-deps -D warnings (stable) rejected two intra-doc links from
public items to non-public targets:
- CcittFaxDecoder doc linked [`decode`] (a pub(crate) fn) — private-intra-doc-links
- ColumnMode doc linked [`build_structured_page_with_mode`] (pub(crate) fn) —
  broken-intra-doc-links (out of public doc scope)
Demote both to plain code spans; they were descriptive, not navigational.
@yfedoseev yfedoseev merged commit fd16e75 into main Jun 16, 2026
241 checks passed
@yfedoseev yfedoseev deleted the release/v0.3.65 branch June 16, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants