release: v0.3.64#735
Merged
Merged
Conversation
…mments - Add fail-fast: false to the CLI Build and MCP Build matrices so a transient single-OS hiccup (e.g. a Windows rust-cache tar/zstd save failure) no longer cancels the other OSes' builds. - Correct the stale codecov-action version comments (# v6, # v5.5.4) to # v7.0.0 to match the already-pinned release SHA.
A hyphen at the end of a wrapped line is an incidental layout artifact
(ISO 32000-1:2008 14.8.2.2.3), so the line seam must not gain a separating
space. Previously only a lowercase continuation suppressed the space, so a
capitalised continuation ("sub-" / "Neptune") produced "sub- Neptune".
Now a line-end alphabetic hyphen never inserts a space; the soft-hyphen
case (lowercase continuation) additionally drops the trailing hyphen.
Unit: merge_lines_tests.
Brackets hug their content in every script, so a space between a CJK or
Hangul character and an adjacent ( ) [ ] { } is a layout artifact, not a
word break (e.g. Korean "고양이 (학명 ... 카투스 [*]) 는" should read
"고양이(학명 ... 카투스[*])는"). The ambiguous Korean digit boundary is
left untouched, as before.
wiki-cat-ko text CER 0.0541 -> 0.0203, word_jaccard 0.488 -> 0.727; no CJK
gold spaces a bracket, so the change cannot diverge from ground truth.
Unit: strip_cjk_digit_boundary_spaces.
A Markdown header row is rendered bold by readers via the |---| separator
beneath it, so explicit ** in header cells is redundant and diverges from
the conventional rendering ("| Region |", not "| **Region** |").
Suppress bold in the header row only when the table has data rows beneath
it; a single-row table is all header, so its emphasis is real content and
is kept. Data cells always keep their emphasis.
table-bordered md CER 0.0351; cleans header bold across corpus table docs
(IRS forms, CFR, etc.). Full markdown lib suite green.
Composite (Type 0) fonts that reference a CJK glyph collection but embed no glyph program (e.g. the predefined Adobe CIDFonts Ryumin-Light, GothicBBB-Medium, STSong-Light) rendered blank on hosts with no system CJK font installed: text extraction works (ToUnicode/CID maps are correct), but the renderer's .notdef fallback bottomed out at a generic non-CJK SansSerif face. Register the bundled Droid Sans Fallback in the renderer's font database under the new opt-in `cjk-render-fallback` feature. The existing fallback resolver already queries the family name "Droid Sans Fallback", so it is picked up as the guaranteed last-resort CJK face with no paint-path changes. System CJK fonts (Noto, SimSun, ...) still take precedence, so quality is unchanged where they exist. ISO 32000-2 §9.7.5.2 requires a processor to support the Adobe predefined character collections. Opt-in (~4 MB), so default rendering builds stay slim. Closes #727.
The main `test` job runs default features only, so every `#[cfg(feature = "rendering")]` test file compiled to zero tests and rendering regressions could land undetected. Add a dedicated job running `cargo test --features rendering,test-support,cjk-render-fallback`, which exercises the previously-dead rendering test tier (plus the bundled CJK fallback). Caps parallel link workers (CARGO_BUILD_JOBS=2) to avoid rust-lld link exhaustion on constrained runners. Closes #711.
Adds a synthetic regression test for composite (Type 0 / Identity-H / CIDFontType2) subset fonts whose content-stream codes are a constant offset from their Unicode values, recoverable only via the font's ToUnicode CMap (ISO 32000-1 §9.10.2 gives ToUnicode highest priority and excepts Identity-H from the predefined-CMap fallback). A positive case asserts the heading decodes correctly; a control with the CMap removed proves the fixture genuinely depends on it. Refs #676.
…test - dump_font: print a font's decode-chain keys (Subtype, Encoding + Differences, ToUnicode presence, embedded FontFile*) for diagnosis. - dump_page_spans: dump extracted text spans with bounding boxes. - extract_markdown: print a page's Markdown conversion. - full_width_header_columns_md: ignored roadmap test asserting a full-width header band is not sliced by a two-column cut (pending the column-region layout model).
Completes the proposal-2 retry coverage for the last network-dependent setup steps that lacked it (proposals 1-3 shipped earlier): - macOS `brew install cmake nasm go` (FIPS CI + release jobs) now retries up to 3x with backoff — Homebrew on hosted macOS runners intermittently fails the formula download on transient DNS/network blips, the exact flake class this issue tracks. - The crates.io "already published?" check in the publish job gains `curl --retry 3 --retry-delay 2`. Deliberately no `--retry-all-errors`, so a genuine 404 (not-yet-published) still falls straight through to the publish branch without delay; only transient network/DNS/5xx blips retry. Proposal 4 (pin macOS runner image) remains intentionally deferred and proposal 5 (auto-rerun watchdog) intentionally rejected per the issue. Closes #544.
# Conflicts: # Cargo.toml
Two spec-grounded fixes to inline math/chemistry text extraction (ISO 32000-1), each with a unit test and an in-code fixture integration test: - Prime-notation numbers no longer gain a spurious word break: a glyph's metric advance (w0, §9.4.4) is narrow relative to a prime's inked form, so the geometric heuristic split "0''.28" into "0'' .28". Add strip_prime_decimal_boundary_spaces, applied alongside the CJK strip, dropping the space at prime/decimal boundaries while leaving genuine feet-and-inches (5' 6") intact. - Signed unit exponents (s-1, m-2) stay ASCII: the super/subscript pass synthesized Unicode sub/superscripts from geometry, overriding ToUnicode (the authoritative source, §9.10) and firing inconsistently. Skip the substitution when the run is a signed number (leading minus + digit). Also document, as an ignored reproducer, the in-prose subscript-float limitation (NH3 inside a sentence): the span-level merge appends to the base span end and cannot place a subscript whose base letter is interior to an assembled line span. The correct fix binds the subscript to its base before line assembly and must be validated against the full corpus; gate tweaks regress it.
Bump version 0.3.63 -> 0.3.64 across every binding (Rust workspace + Cargo.lock, Python, Node/WASM, Go, Java, Ruby, PHP, C#) and the version parity tests, and finalize the 0.3.64 CHANGELOG. Release highlights: composite-CJK page rendering (bundled Droid Sans fallback for embed-less Type 0 fonts and Adobe predefined CIDFonts), §11 transparency surface with optional lcms2 colour management, cross-document font-cache /ToUnicode correctness, valid annotation appearance streams, and math/CJK text-extraction polish. Validated by a 423-PDF v0.3.63->v0.3.64 space-aware regression sweep (word-Jaccard + content-ratio): zero regressions, no perf regression. Closes #727 Closes #711 Closes #544
OSV-Scanner flagged RUSTSEC-2026-0176 — an out-of-bounds read in pyo3's optimized `nth`/`nth_back` for `PyList`/`PyTuple` iterators (unchecked `index + n`), fixed in pyo3 0.29.0. Not reachable in our binding: it only builds lists via `PyList::empty` + `append` and never calls `nth`/`nth_back` on a Python list/tuple iterator with a caller-controlled index. The upstream fix (pyo3 0.29) is not yet adoptable because pyo3-log (latest 0.13.3) requires pyo3 `>=0.26, <0.29`, so taking it would drop the Rust->Python logging bridge. Time-boxed ignore (review 2026-09-10); bump pyo3 + pyo3-log together once pyo3-log adds 0.29 support. Not a regression from this PR — pyo3 0.28.3 is the pre-existing pin on main.
OSV reports exactly two advisories for pyo3 0.28.3: RUSTSEC-2026-0176 (already ignored) and RUSTSEC-2026-0177 — a missing Sync bound on PyCFunction::new_closure closures. We never call new_closure/new_closure_bound (all Python callables are #[pyfunction]/#[pymethods]), so the unsoundness is unreachable. Same pyo3 0.29 fix, same pyo3-log <0.29 blocker as 0176.
…substitution The cjk-render-fallback substitution path resolved a CIDFont's CID to a Unicode point only via the Adobe character-collection table (CharacterCollection::cid_to_unicode). That is correct for a real predefined CIDFont (Ryumin-Light, …) whose CIDs ARE the collection's CIDs, but wrong for an Identity-encoded subset whose arbitrary CIDs only resolve through the document's /ToUnicode CMap — there CID 1 is not collection CID 1, so the substitution painted the wrong glyph (often a blank space) and the page came out empty. Resolve CID -> Unicode by /ToUnicode first (authoritative per ISO 32000-1 §9.10.2, filtering U+FFFD/FFFE/FFFF placeholders), then fall back to the Adobe collection table when the font ships no /ToUnicode (the common case for the predefined CIDFonts the substitution targets). Fixes the previously-never-run-in-CI render test surfaced by the new rendering tier (#711); does not regress the predefined-collection path (#730 tests pass).
The Security Audit (cargo-audit) and Dependency Check (cargo-deny) gates read their own ignore lists (.cargo/audit.toml / deny.toml), not the osv-scanner.toml used by the OSV-Scanner gate. Add the same two pyo3 0.28.x advisories (RUSTSEC-2026-0176 OOB read in nth/nth_back; RUSTSEC-2026-0177 missing Sync on PyCFunction::new_closure) to both, with the same justification: vulnerable APIs unused in our binding, fix (pyo3 0.29) blocked by pyo3-log capping at pyo3 <0.29. Review 2026-09-10.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release v0.3.64
Composite-CJK page rendering — a bundled Droid Sans fallback now paints embed-less Type 0 fonts and the Adobe predefined CIDFont collections — plus a §11 transparency compositing surface with optional lcms2 colour management, cross-document font-cache correctness, valid annotation appearance streams, and math/CJK text-extraction polish (prime-notation spacing, signed unit exponents, CJK bracket spacing, table-header Markdown).
Added
cjk-render-fallback(feat(render): substitute Adobe predefined CIDFonts via bundled CJK fallback #730)IccBackendwith optionallcms2backend (feat(rendering): §11 transparency surface + IccBackend trait with optional lcms2 backend #674)Fixed
/ToUnicodecontent (fix(fonts): include /ToUnicode content in the cross-document font-cache key #733)/APappearance streams emitted as indirect objects (fix(writer): emit annotation /AP appearance streams as indirect objects #713).pyistubs no longer leak the pyo3Py<Self>receiver (fix(python): drop leaked Py<Self> receiver params from generated .pyi #728)Validation
¬/:-for-decimal class). No perf regression: 1000+ page CFR docs extract text in ~3s, full text+md+html in ~26s, baseline ≈ candidate.cargo fmt --check,cargo clippy --all-targets --workspace -D warnings,ruff check/ruff format --check,rubocop,phpstan,composer validate,gofmt, changelog release-notes check.Closes
Closes #727
Closes #711
Closes #544
Refs (shipped this release, merged via main)
#730, #674, #733, #713, #728, #729, #712, #731, #676
Thanks
/ToUnicodekey collision (fix(fonts): include /ToUnicode content in the cross-document font-cache key #733)/APstreams, pyo3.pyireceiver, UTF-16 bookmark decoding (fix(writer): emit annotation /AP appearance streams as indirect objects #713, fix(python): drop leaked Py<Self> receiver params from generated .pyi #728, fix(parsing): support bookmark titles encoded in UTF-16BE/LE #729)