release: v0.3.65 — multilingual & layout extraction quality#739
Merged
Conversation
…cing; #458 threads Corpus-validated (v0.3.64 vs HEAD, 156-PDF sweep). Ships the parts that are spec-correct and byte-identical-safe; the automatic text/md/html column-major reorder is REVERTED — the sweep proved the per-flow heuristics regress real two-column PDFs (arxiv references scrambled, a road data-sheet form glued, a TOC's page numbers detached), the same wall the in-partition band-peel hit 3×. text/md/html output is byte-identical to v0.3.64 (verified on the three regressors). The correct fix is a single recursive XY-cut column-region pass (`page_reading_order`) consumed by ALL flows; tracked as a follow-up and gated on the corpus (KJF text/md/html targets are #[ignore]d until it lands). Shipped (corpus-safe / additive): * structured `extract_structured`: populate `column_index` for two-column bodies (gutter-bridge-aware detector) and group one region per column in column-major order (#734 §1/§2); best-effort, structured-only. * tagged structure surfacing (§14.8.4): `Lbl` → MarginalLabel (#734 §4); nearest `Sect`/`Art`/`Part` → document-stable `section_id` on each region, giving cross-page chapter continuity (#734 §5) and spillover grouping (#734 §6). Additive — empty for untagged/suspect PDFs, so their output is unchanged. traversal.rs tracks the nearest Sect/Art/Part ancestor per MCID. * ColumnMode {Auto,Two,Single} via MCP `column_mode` + CLI `--column-mode`. * #458 article threads: page_article_bead_rects gate (≥2 beads ∧ ≥80% coverage ∧ ≥2 x-bands ∧ order-divergence) wired into page_reading_order, gated on !has_structure_tree; order-divergence keeps single-column threads identical. Reverted (corpus-regressing): per-flow column-major in assemble_text_from_spans, to_markdown/to_html (split-convert), and the gutter/cmp/split helpers. Local guard battery + corpus diff: text/md/html byte-identical to v0.3.64 on the 3 regressors; structured/tagged/threads tests green.
…i, reading order, word-seg Corpus-verified 0-regression (156-PDF byte-sweep + per-doc CER vs pdf_benches golds): - RTL Arabic/Hebrew: number-run preservation on visual->logical reversal (UAX#9 L2, reverse_rtl_keep_numbers); right-joining joining-type discriminator for interior-space strip; density-gated de-shatter of glyph-exploded words; and the cross-span GLYPH INTERLEAVE repair (merge_interleaved_rtl_lines): producer-drawn zero-width mark/consonant spans whose x falls inside a body span are re-collapsed per visual line using the producer's standalone-space word boundaries, fixing al-thadyiyaat-class scrambles. wiki-cat-ar CER 0.134 -> 0.079; gated so already-correct RTL (BidiSample, ArabicCIDTrueType, hebrew_mirrored: zero width-0 spans) is byte-identical. - RW-1 multi-region reading order: sidebar_body_reading_order band-first emit for narrow-sidebar+wide-body pages (PMC title no longer shattered along the body gutter). - Two-column reading order (attempt #5), Korean over-seg, Bengali/Hindi clause-punct hug, CID-Tw single-byte gating (#9.3.3). Hand-built in-code fixtures (no third-party PDFs); RTL/reading-order guard battery green.
… word-boundary fixes
RW-1e (real-academic): honor the ISO 32000-1 §8.10.1 Form XObject /BBox clip in
process_xobject — drop spans the form paints OUTSIDE its /BBox. pdfTeX figure
forms embed a 'FOR PEER REVIEW' draft-galley page outside the figure BBox that a
conformant renderer clips; pdf_oxide was emitting it as duplicate body text.
PMC8103263 CER 0.510->0.155 (pymupdf parity), dup_ratio 0.254->0.025. Allocation-
free fast path keeps form-heavy docs fast. Matches pymupdf's verified behaviour.
SEG-HE: number-preserving RTL visual->logical reversal of a neutral+single-digit
span in a pure-RTL run (' ,2009-' -> '-2009, '); wiki-cat-he wJacc 0.873->1.000.
SEG-AR: a space bordering non-cursive clause punctuation is a real word break, not
a cursive shatter space; wiki-cat-ar wJacc 0.574->0.600.
156-PDF byte-sweep: all RTL fixtures (BidiSample/ArabicCIDTrueType/hebrew_mirrored/
PDFBOX-4531/issue10301/issue18117) byte-identical; only deterministic corpus change
is one benign same-length arxiv reorder. Hand-built in-code fixtures.
… pages span_overlaps_rotated_chars drops a span only when its nearest char is rotated (>=5deg); on a page with NO rotated char (the overwhelming majority) the per-span nearest-char scan ran O(spans*chars) only to never drop anything. Gate the whole retain behind one O(chars) precheck — byte-identical output, removes the quadratic on every unrotated page of the PDF->IR (docx/pptx/xlsx/round-trip) path.
…rge, word extraction All byte-identical (the 156-PDF byte-sweep shows ZERO output change vs the prior binary; only the pre-existing RW-1e arxiv reorder differs). - PERF-1: untagged table-ownership filter (document.rs) — per-span containment test was O(spans*cells) (~2e7 AABB tests on dense table pages). Index cell bboxes into coarse y-bands once; a span only probes cells in its y-band. A containing cell always shares the span's y-band, so identical to the full scan. - PERF-6: is_single_column_region (xycut.rs) — two O(k^2) within-tolerance cluster scans over per-region lines/gaps replaced by sort + partition_point binary search (O(k log k)); max-count / any-cluster are multiset properties. - PERF-5: merge_hyphenated_spans (docx/pptx/xlsx) — Vec::remove(i+1) in a loop with no-advance-on-merge was O(n^2); rewritten as one forward pass with a running accumulator that chains exactly as before. - PERF-7: extract_text_as_words / _with_custom_gaps (document.rs) — to_chars() was materialized twice per span; materialize once and reuse.
…order
reorder_rowspan_labels promoted a numbered reference/figure-legend
marker column ('1.', '2.', ...) as if it were a multi-row rowspan
label, hoisting the markers out of reading order. Detect a vertical
numbered list (>=3 markers in a tight left-edge cluster spread over
>=3 rows) and exclude those markers from label promotion, keeping the
legend adjacent to its figure title instead of scrambled into the body.
Corpus: 155/156 byte-identical; the one change is a pure reorder (zero
content loss) of a scanned figure-legend, keeping it with its caption.
TDD: test_rowspan_skips_numbered_reference_continuation (and the
existing genuine-rowspan promotion test still passes).
…search select_drop_cap_initials rescanned every span for each oversized initial to find the nearest body continuation (O(initials*spans)). Pre-sort span indices by left edge once and probe only the narrow candidate x-window [init_right - max_fs*0.5, init_right + max_fs*0.12] via partition_point; the exact per-candidate gap test is unchanged, so output is byte-identical. Completes the O(n^2) hotspot sweep (PERF-2, the drop-cap item deferred from 3352b2a). Corpus: 156/156 byte-identical on the regression sweep.
…RTL Arabic Arabic producers that emit glyphs via /ReversedChars + per-glyph /ActualText (ISO 32000-1 §14.8.2.3.3 / §14.9.4) reposition glyphs out of advance-order. When should_merge glued adjacent Arabic glyph spans, to_chars() reconstructed each glyph by cumulative font advance from the span's left edge, discarding the producer's true positions: e.g. lam/alif of القهوة landed at advance-x 539/542 instead of their true 548/552, so the zero-width qaf (true x 543) sorted between them and the RTL visual-order pass emitted قالهوة. After merging, stretch the advance leading into the merged-in span's first glyph so to_chars() reconstructs it at span.bbox.x, keeping per-glyph positions truthful for merge_rtl_line_to_visual_span's ascending-x sort. Gated to Arabic so Latin/CJK stay byte-identical. Result: القهوة/استهلاكًا/شائعة now correct (matches pdfium, the reference that gets this fixture perfect; pymupdf scrambles it). wiki-cat-ar text CER 0.079->0.004, wJacc 0.600->0.975; arabic-structured body de-scrambled (CER 0.092). 156-PDF byte-sweep: zero changes (all 6 RTL fixtures byte-identical) — the adjust is ~0 whenever advance already equals the true gap, so only genuinely scrambled producers are touched.
…r columns reorder_column_major_with_bands buffered a bottom-left References block (below both columns) into the left-column partition, so it printed BEFORE the entire right column instead of last. Peel any block lying a full line-height below the opposite column's bottom into a trailing group emitted after both columns at its own y. Guarded: only fires when the opposite column has real content (>=2 spans) and the block clears its bottom by a line, so balanced 2-col bodies (columns ending at ~equal y) are byte-identical. academic-2col: References now reads LAST (was mid-document). 156-PDF byte-sweep: zero changes (arxiv/SF1199A/issue1905/KJF byte-identical).
…span-level split sidebar_body_reading_order clustered spans into baseline LINES first, but a publisher-metadata sidebar (MDPI/Frontiers/PMC) shares baselines with the body (the narrow left metadata column interleaves with body lines by Y), so each line fused sidebar+body into one full-width band and the sidebar never separated. These PDFs also carry NO background tint to anchor the sidebar (confirmed: zero drawings behind it), so geometry alone is indistinguishable from a label:value form. Classify per SPAN by the gutter instead of per line, and gate the reorder on a semantic anti-form discriminator: the sidebar column must carry >=2 distinct publisher-furniture labels (Citation/Received/Accepted/Published/Copyright/ Licensee/Academic Editor/Creative Commons/…) — furniture that never heads a form field or body column. Emit body (+ any true full-width band) top-to-bottom, then the sidebar last. PMC8103263 extract_text: body now reads contiguously (title→abstract→intro→body) with all metadata furniture at the end (was interleaved baseline-by-baseline). 156-PDF byte-sweep: ZERO changes — the >=2-label gate fires on no corpus PDF, so forms (SF1199A), N-up spreads, and narrow-column pages are byte-identical.
…arkdown + html The plain-text path orders via extract_spans (which applies sidebar_body_reading_order), but to_markdown_inner / to_html_inner re-derive reading order through the pipeline afterward, re-interleaving the metadata column. Apply the same sidebar gate on those paths and preserve its order (via preserve_input_order) on untagged pages, mirroring the two-column-prose handling. A trustworthy struct tree's mcid order still wins. PMC8103263 markdown/html: metadata sidebar (Citation/Received/Copyright/…) now reads after the body, matching the plain-text path. 156-PDF byte-sweep: zero changes (the >=2-furniture-label gate fires on no corpus PDF).
…dChars Arabic Producers that draw RTL glyphs individually under /ReversedChars (ISO 32000-1 §14.8.2.3.3) mark real word boundaries with EXPLICIT space glyphs and never encode them as inter-glyph gaps (confirmed: arabic-structured + wiki-cat-ar emit zero literal-space Tj strings; spaces are F5 <0003> glyphs). oxide's geometric space detector then inserted spurious spaces between cursively adjacent Arabic letters, shattering words (إسبريسو -> إس بر يسو). Track a per-page saw_reversed_chars flag (set on /ReversedChars BMC) and, on such pages, suppress a GEOMETRIC space between two Arabic letters — explicit space glyphs (whitespace-only spans) still segment words. Gated to ReversedChars pages so ordinary geometric-spaced Arabic producers are untouched. arabic-structured: إسبريسو/كابتشينو/قهوة مقطّرة/والمناطق now whole; text CER 0.092->0.073, wJacc 0.488->0.657. wiki-cat-ar held at CER 0.004 / wJacc 0.975. 156-PDF byte-sweep: zero changes (gate fires on no corpus PDF).
…l_order Records why the markdown/html RTL path mis-orders an interleaved zero-width glyph (sorts whole spans by x, cannot place a glyph intra-word) and points to the glyph-level text-path reconstruction as the fix, so a future dev finds the root cause without re-deriving it. Comment-only; no behavior change.
…Align CCITT G4 fax images rendered blank because (1) /EncodedByteAlign (common in fax scanners) was parsed but never applied — the third-party fax crate has no such hook and its bit reader is private — so byte-padded rows decoded to garbage, and (2) on any decode failure the code returned an all-white buffer (mean=255, std=0) reported as success, masking the failure as a blank page. Replace the fax-crate image path with an in-house ITU-T T.6 (Group 4) decoder (src/decoders/ccitt.rs): MSB-first bit reader, the verified 2D mode + Modified- Huffman run tables (generated from T.4, asserted prefix-free), the reference- line changing-element walk ported from the verified fax-crate logic, plus the two gaps the crate could not express: - EncodedByteAlign: skip zero fill to the next byte boundary between rows (pdfium-guarded — a 1 in the fill region disables alignment rather than corrupting every subsequent row). - Partial recovery: keep the rows decoded before a truncation/damage and white-pad the tail, instead of discarding the whole page. Zero decodable rows returns Err. decompress_ccitt routes to the in-house decoder first, falls back to the fax crate (Group 3, or G4 streams it declines), and only as a last resort emits a LOUDLY-WARNED blank — never a silent one. Validated against a pymupdf/pdfium oracle: byte-aligned G4 (with /Rows, no /Rows, tall pages, /BlackIs1) now decodes pixel-identical to the oracle (was corrupt/blank); a truncated stream recovers partial content (was 255/0 blank); real CCITT fixtures (issue_395 9pp, 538250) decode byte-identical to the prior fax path (no regression, no crash). Unit tests cover V0/Horizontal/byte-align/ zero-rows + table prefix-freeness.
The RW-1e clip (ISO 32000-1 §8.10.1) dropped text a Form XObject paints outside its /BBox. That is correct for a figure whose embedded source PDF retained a draft-galley underlay, but wrong for a full-page content-frame wrapper whose declared BBox happens to exclude real body text: a conformant renderer clips it, yet every text extractor (poppler/pdftotext, the common reference) keeps it — and for a wrapper it is the body's only copy. Discriminate by coverage: a figure occupies a sub-region of the page; a wrapper covers most of it. Only apply the clip when the mapped clip rect is figure-sized (< 60% of page MediaBox area). Measured figures cover <=27% of the page; the regressing wrapper covered 82%. The galley-dedup win is kept (figure forms still clip) while page-wrapper bodies are preserved. Found in the v0.3.64->v0.3.65 release regression sweep (301 PDFs): a full-page-wrapper page lost its body (2726->115 bytes). Post-fix the page is byte-identical to v0.3.64 and the dedup fixture is unchanged; the rest of the 301-PDF sweep is byte-identical to the pre-fix candidate (zero collateral).
CHANGELOG 0.3.65 (2026-06-16): multilingual & layout extraction quality — RTL Arabic/Hebrew bidi reconstruction, multi-region reading order (publisher sidebars, two-column academic), CJK/Indic word segmentation — plus an in-house CCITT Group 4 fax decoder (#738), two-column structured surfacing (#734), reading-order threads (#458), and O(n^2) hot-path removals. Bumps 0.3.64 -> 0.3.65 in every binding manifest/version/parity test (Cargo workspace cli/mcp/jni, Cargo.lock, pyproject already at 0.3.65, js/wasm package.json, java pom, ruby, php, csharp, go, python/rust parity tests) and the README Maven example.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Release v0.3.65 focuses on improving multilingual extraction quality (RTL bidi, CJK/Indic segmentation), layout-aware reading order (two-column + sidebars + threads), and performance, alongside a new in-house CCITT Group 4 decoder and version parity bumps across bindings.
Changes:
- Adds/extends extraction features: structured two-column regions (
column_index), tagged structure surfacing (Lbl→role,Sect/Art/Part→section_id), article-thread reading order, and Form/BBoxclipping. - Improves RTL and word segmentation: digit-preserving RTL reversal, Arabic joining-type handling, Hangul mid-word wrap rejoin, Indic punctuation hugging.
- Removes hot-path O(n²) patterns and bumps versions/docs/tests to 0.3.65.
Reviewed changes
Copilot reviewed 53 out of 55 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| wasm-pkg/package.json | Bumps wasm package version to 0.3.65. |
| tests/v0365_targets_and_locks.rs | Adds regression targets/locks for reading order + RTL behavior. |
| tests/seg_ko_functional.rs | Adds end-to-end Korean segmentation regression tests. |
| tests/seg_indic_functional.rs | Adds end-to-end Indic punctuation hugging regression tests. |
| tests/rw1e_form_bbox_clip.rs | Adds regression tests for Form XObject /BBox clipping in extraction. |
| tests/issue_734_two_column_text_order.rs | Adds integration tests for two-column reading order + structured regions. |
| tests/issue_734_tagged_structure.rs | Adds tests for tagged structure surfacing (Lbl, section continuity). |
| tests/issue_458_article_threads.rs | Adds tests for parsing/applying /Threads bead ordering. |
| tests/full_width_header_columns_md.rs | Updates prior ignore/commentary to reflect #734 fix being in place. |
| tests/core_parity.rs | Updates version-parity assertion to 0.3.65. |
| src/text/rtl_detector.rs | Adds Arabic right-joining letter classifier + tests. |
| src/text/bidi.rs | Adds digit-preserving RTL reversal helper + tests. |
| src/structured.rs | Adds section_id, column mode override API, and improved gutter detection/grouping. |
| src/structure/traversal.rs | Threads section_id through structure traversal output. |
| src/pipeline/reading_order/xycut.rs | Replaces O(k²) clustering logic with sort + window counting. |
| src/pipeline/reading_order/mod.rs | Adds preserve_input_order to avoid re-sorting already-reordered spans. |
| src/pipeline/page_order.rs | Wires article-thread bead ordering into default reading-order path with gates. |
| src/pipeline/mod.rs | Implements preserve-input-order fast path in the text pipeline. |
| src/lib.rs | Exports ColumnMode in the public API. |
| src/extractors/text.rs | Improves RTL handling, fixes Tw application for multibyte codes, adds Form /BBox clip logic. |
| src/extractors/ccitt_bilevel.rs | Uses in-house CCITT G4 decoder with fallback + changes bilevel helper visibility. |
| src/document.rs | Adds two-column prose reorder, sidebar-body reorder, RTL fixes, segmentation fixes, and perf wins. |
| src/decoders/mod.rs | Exposes ccitt module internally for decoder use. |
| src/decoders/ccitt.rs | Replaces pass-through-only CCITT module docs with in-house G4 decoder + tests. |
| src/converters/xlsx_layout.rs | Removes O(n²) hyphen-merge by using single-pass accumulator. |
| src/converters/pptx_layout.rs | Removes O(n²) hyphen-merge by using single-pass accumulator. |
| src/converters/pdf_to_ir.rs | Avoids quadratic rotated-char filtering when no rotated chars exist. |
| src/converters/docx_layout.rs | Removes O(n²) hyphen-merge by using single-pass accumulator. |
| ruby/spec/core_parity_spec.rb | Updates Ruby binding version parity check to 0.3.65. |
| ruby/spec/cdylib_smoke_spec.rb | Updates Ruby cdylib smoke version check to 0.3.65. |
| ruby/lib/pdf_oxide/version.rb | Bumps Ruby gem version constant to 0.3.65. |
| python/tests/test_core_parity.py | Updates Python binding version parity check to 0.3.65. |
| pyproject.toml | Bumps Python package version to 0.3.65. |
| php/tests/Integration/CoreParityTest.php | Updates PHP binding version parity check to 0.3.65. |
| php/src/Pdf.php | Bumps PHP version constant to 0.3.65. |
| php/scripts/download-native-lib.php | Bumps PHP installer defaults/user-agent to v0.3.65. |
| pdf_oxide_mcp/src/protocol.rs | Adds MCP schema option column_mode for structured extraction. |
| pdf_oxide_mcp/src/extract.rs | Implements MCP column_mode validation and passes it to structured extraction. |
| pdf_oxide_mcp/Cargo.toml | Bumps MCP crate + dependency version to 0.3.65. |
| pdf_oxide_jni/Cargo.toml | Bumps JNI crate + dependency version to 0.3.65. |
| pdf_oxide_cli/tests/structured_format.rs | Adds CLI tests for --column-mode override semantics. |
| pdf_oxide_cli/src/cli/repl.rs | Plumbs default column_mode=auto through REPL text command. |
| pdf_oxide_cli/src/cli/mod.rs | Plumbs column_mode through CLI dispatch and stdin path. |
| pdf_oxide_cli/src/cli/commands/text.rs | Implements --column-mode mapping for structured extraction. |
| pdf_oxide_cli/src/cli/args.rs | Adds --column-mode CLI option for structured format. |
| pdf_oxide_cli/Cargo.toml | Bumps CLI crate + dependency version to 0.3.65. |
| js/package.json | Bumps JS package version to 0.3.65. |
| java/pom.xml | Bumps Java binding version/tag to 0.3.65. |
| go/cmd/install/main.go | Bumps Go installer fallback version to 0.3.65. |
| csharp/PdfOxide/PdfOxide.csproj | Bumps NuGet package version to 0.3.65. |
| README.md | Updates Maven dependency snippet version to 0.3.65. |
| Cargo.toml | Bumps core crate version to 0.3.65. |
| CHANGELOG.md | Adds v0.3.65 changelog entry detailing features/fixes/perf. |
Comments suppressed due to low confidence (6)
src/extractors/ccitt_bilevel.rs:1
- The blank-image fallback size is now derived from
params.rows, butparams.rowscan beNoneeven when the caller providedheight_opt(e.g., image dictionary has Height but DecodeParms omits Rows). This can under-allocate the output (often to 1 row), producing an invalid/incorrect image buffer. Useheight_optas the primary fallback height (as before), and only fall back toparams.rows(or vice versa) in a way that preserves the known image height.
src/pipeline/page_order.rs:1 parse_article_threads(doc)is invoked insidepage_article_bead_rects, which is called per page in the default reading-order path. If thread parsing walks a significant portion of the document structure, this can become an avoidable O(pages × parse_cost) overhead. Consider caching parsed threads at thePdfDocumentlevel (or in a per-document lazy cache) and reusing them across pages, then filtering to the current page's beads.
src/pipeline/page_order.rs:1- The order-divergence gate compares only
xandywhen checking whether bead order matches geometric order. If two beads share the same(x, y)but differ inwidth/height(or if floating-point representation differs slightly), the gate can incorrectly treat distinct beads as equal and suppress thread application. Compare the full rectangle identity (x, y, width, height) with exact or tolerant comparison, or use a stable bead identifier if available.
src/document.rs:1 - Switching
prev_spanfromOption<&TextSpan>toOption<TextSpan>forces a fullTextSpanclone for every emitted span.TextSpancan carry sizable allocations (e.g.,text,char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persistprev_spanacross MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
src/document.rs:1 - Switching
prev_spanfromOption<&TextSpan>toOption<TextSpan>forces a fullTextSpanclone for every emitted span.TextSpancan carry sizable allocations (e.g.,text,char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persistprev_spanacross MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
src/document.rs:1 - Switching
prev_spanfromOption<&TextSpan>toOption<TextSpan>forces a fullTextSpanclone for every emitted span.TextSpancan carry sizable allocations (e.g.,text,char_widths), so this can materially increase CPU and memory pressure on large pages. If the borrow-checker constraint is the need to persistprev_spanacross MCID iterations where spans may come from a temporary vector, consider storing only the fields needed for spacing/line-break decisions (bbox, font_size, and a small amount of text-derived info), or cloning a lightweight struct rather than the full span.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The v0.3.65 batch landed without a full local CI pass (shared-box build contention); this clears every failure the PR surfaced. - rustfmt: format ccitt.rs, document.rs, ccitt_bilevel.rs, text.rs, rtl_detector.rs, and the v0365 lock test (committed unformatted). - clippy (-D warnings): manual_range_contains, manual_clamp, useless_conversion, useless_vec (document.rs); manual_is_multiple_of (ccitt.rs); doc_lazy_continuation x2 (structured.rs); unnecessary_map_or -> is_some_and (text.rs RW-1e gate). - rw1_full_width_title_reads_contiguously: the D3 sidebar-segregation rewrite added an anti-false-positive gate requiring >=2 DISTINCT furniture labels, but the fixture's sidebar was 8x 'Citation' (one distinct label) and the title words were spread with gaps that split at the gutter, so sidebar_body_reading_order never engaged. Make the fixture realistic: a single full-width title run plus eight distinct furniture labels (Citation/Received/Accepted/...), keeping the span count above the classifier's 30-span floor. Locks the real behaviour; zero production code change.
cargo doc --no-deps -D warnings (stable) rejected two intra-doc links from public items to non-public targets: - CcittFaxDecoder doc linked [`decode`] (a pub(crate) fn) — private-intra-doc-links - ColumnMode doc linked [`build_structured_page_with_mode`] (pub(crate) fn) — broken-intra-doc-links (out of public doc scope) Demote both to plain code spans; they were descriptive, not navigational.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.3.65 — Multilingual & layout extraction quality
Right-to-left bidi reconstruction (Arabic/Hebrew), multi-region reading order (publisher sidebars + two-column academic), CJK/Indic word segmentation, an in-house CCITT Group 4 fax decoder, structured two-column surfacing, and a batch of O(n²) hot-path removals.
Added
column_index,Lbl→marginal label,Sect/Art/Part→section_id(ISO 32000-1 §14.8.4)./Threads→/Bbead ordering for content that flows across columns/pages.EncodedByteAlign, partial-row recovery; replaces a silent all-white fallback.Fixed
الثدييات), RTL number preservation (٤٣٤١→١٤٣٤,ל ,2009-→ל-2009,), glyph-advance preservation on/ReversedCharsproducers./BBoxclip (gated to figure-sized forms so full-page wrappers keep their body).Changed
apply_pending_clip— sub-perceptual pixel change.Verification
google_doc_document.pdftable GUARD byte-identical.Contributors
Thanks @RayVR (#654), @potatochipcoconut (#738), @lggcs (#734).
Closes #738
Closes #734
Closes #458
Ref: #654