Skip to content

release: v0.3.64#735

Merged
yfedoseev merged 16 commits into
mainfrom
release/v0.3.64
Jun 12, 2026
Merged

release: v0.3.64#735
yfedoseev merged 16 commits into
mainfrom
release/v0.3.64

Conversation

@yfedoseev

Copy link
Copy Markdown
Owner

Release v0.3.64

Composite-CJK page rendering — a bundled Droid Sans fallback now paints embed-less Type 0 fonts and the Adobe predefined CIDFont collections — plus a §11 transparency compositing surface with optional lcms2 colour management, cross-document font-cache correctness, valid annotation appearance streams, and math/CJK text-extraction polish (prime-notation spacing, signed unit exponents, CJK bracket spacing, table-header Markdown).

Added

Fixed

Validation

  • v0.3.63 → v0.3.64 space-aware regression sweep over 423 PDFs (word-Jaccard + content-ratio, across math/CJK/academic/government/forms/mixed/news/theses/technical/multi-column/headers): 0 regressions — every diff is a net improvement, including recovered chart-number decoding (the CMSY ¬/:-for-decimal class). No perf regression: 1000+ page CFR docs extract text in ~3s, full text+md+html in ~26s, baseline ≈ candidate.
  • Local gates green: cargo fmt --check, cargo clippy --all-targets --workspace -D warnings, ruff check/ruff format --check, rubocop, phpstan, composer validate, gofmt, changelog release-notes check.

Closes

Closes #727
Closes #711
Closes #544

Refs (shipped this release, merged via main)

#730, #674, #733, #713, #728, #729, #712, #731, #676

Thanks

yfedoseev added 16 commits June 9, 2026 20:36
…mments

- Add fail-fast: false to the CLI Build and MCP Build matrices so a
  transient single-OS hiccup (e.g. a Windows rust-cache tar/zstd save
  failure) no longer cancels the other OSes' builds.
- Correct the stale codecov-action version comments (# v6, # v5.5.4) to
  # v7.0.0 to match the already-pinned release SHA.
A hyphen at the end of a wrapped line is an incidental layout artifact
(ISO 32000-1:2008 14.8.2.2.3), so the line seam must not gain a separating
space. Previously only a lowercase continuation suppressed the space, so a
capitalised continuation ("sub-" / "Neptune") produced "sub- Neptune".
Now a line-end alphabetic hyphen never inserts a space; the soft-hyphen
case (lowercase continuation) additionally drops the trailing hyphen.

Unit: merge_lines_tests.
Brackets hug their content in every script, so a space between a CJK or
Hangul character and an adjacent ( ) [ ] { } is a layout artifact, not a
word break (e.g. Korean "고양이 (학명 ... 카투스 [*]) 는" should read
"고양이(학명 ... 카투스[*])는"). The ambiguous Korean digit boundary is
left untouched, as before.

wiki-cat-ko text CER 0.0541 -> 0.0203, word_jaccard 0.488 -> 0.727; no CJK
gold spaces a bracket, so the change cannot diverge from ground truth.
Unit: strip_cjk_digit_boundary_spaces.
A Markdown header row is rendered bold by readers via the |---| separator
beneath it, so explicit ** in header cells is redundant and diverges from
the conventional rendering ("| Region |", not "| **Region** |").
Suppress bold in the header row only when the table has data rows beneath
it; a single-row table is all header, so its emphasis is real content and
is kept. Data cells always keep their emphasis.

table-bordered md CER 0.0351; cleans header bold across corpus table docs
(IRS forms, CFR, etc.). Full markdown lib suite green.
Composite (Type 0) fonts that reference a CJK glyph collection but embed
no glyph program (e.g. the predefined Adobe CIDFonts Ryumin-Light,
GothicBBB-Medium, STSong-Light) rendered blank on hosts with no system
CJK font installed: text extraction works (ToUnicode/CID maps are
correct), but the renderer's .notdef fallback bottomed out at a generic
non-CJK SansSerif face.

Register the bundled Droid Sans Fallback in the renderer's font database
under the new opt-in `cjk-render-fallback` feature. The existing fallback
resolver already queries the family name "Droid Sans Fallback", so it is
picked up as the guaranteed last-resort CJK face with no paint-path
changes. System CJK fonts (Noto, SimSun, ...) still take precedence, so
quality is unchanged where they exist. ISO 32000-2 §9.7.5.2 requires a
processor to support the Adobe predefined character collections.

Opt-in (~4 MB), so default rendering builds stay slim.

Closes #727.
The main `test` job runs default features only, so every
`#[cfg(feature = "rendering")]` test file compiled to zero tests and
rendering regressions could land undetected. Add a dedicated job running
`cargo test --features rendering,test-support,cjk-render-fallback`, which
exercises the previously-dead rendering test tier (plus the bundled CJK
fallback). Caps parallel link workers (CARGO_BUILD_JOBS=2) to avoid
rust-lld link exhaustion on constrained runners.

Closes #711.
Adds a synthetic regression test for composite (Type 0 / Identity-H /
CIDFontType2) subset fonts whose content-stream codes are a constant
offset from their Unicode values, recoverable only via the font's
ToUnicode CMap (ISO 32000-1 §9.10.2 gives ToUnicode highest priority and
excepts Identity-H from the predefined-CMap fallback). A positive case
asserts the heading decodes correctly; a control with the CMap removed
proves the fixture genuinely depends on it.

Refs #676.
…test

- dump_font: print a font's decode-chain keys (Subtype, Encoding +
  Differences, ToUnicode presence, embedded FontFile*) for diagnosis.
- dump_page_spans: dump extracted text spans with bounding boxes.
- extract_markdown: print a page's Markdown conversion.
- full_width_header_columns_md: ignored roadmap test asserting a
  full-width header band is not sliced by a two-column cut (pending the
  column-region layout model).
Completes the proposal-2 retry coverage for the last network-dependent
setup steps that lacked it (proposals 1-3 shipped earlier):

- macOS `brew install cmake nasm go` (FIPS CI + release jobs) now retries
  up to 3x with backoff — Homebrew on hosted macOS runners intermittently
  fails the formula download on transient DNS/network blips, the exact
  flake class this issue tracks.
- The crates.io "already published?" check in the publish job gains
  `curl --retry 3 --retry-delay 2`. Deliberately no `--retry-all-errors`,
  so a genuine 404 (not-yet-published) still falls straight through to the
  publish branch without delay; only transient network/DNS/5xx blips retry.

Proposal 4 (pin macOS runner image) remains intentionally deferred and
proposal 5 (auto-rerun watchdog) intentionally rejected per the issue.

Closes #544.
Two spec-grounded fixes to inline math/chemistry text extraction (ISO
32000-1), each with a unit test and an in-code fixture integration test:

- Prime-notation numbers no longer gain a spurious word break: a glyph's
  metric advance (w0, §9.4.4) is narrow relative to a prime's inked form,
  so the geometric heuristic split "0''.28" into "0'' .28". Add
  strip_prime_decimal_boundary_spaces, applied alongside the CJK strip,
  dropping the space at prime/decimal boundaries while leaving genuine
  feet-and-inches (5' 6") intact.

- Signed unit exponents (s-1, m-2) stay ASCII: the super/subscript pass
  synthesized Unicode sub/superscripts from geometry, overriding ToUnicode
  (the authoritative source, §9.10) and firing inconsistently. Skip the
  substitution when the run is a signed number (leading minus + digit).

Also document, as an ignored reproducer, the in-prose subscript-float
limitation (NH3 inside a sentence): the span-level merge appends to the
base span end and cannot place a subscript whose base letter is interior
to an assembled line span. The correct fix binds the subscript to its
base before line assembly and must be validated against the full corpus;
gate tweaks regress it.
Bump version 0.3.63 -> 0.3.64 across every binding (Rust workspace +
Cargo.lock, Python, Node/WASM, Go, Java, Ruby, PHP, C#) and the version
parity tests, and finalize the 0.3.64 CHANGELOG.

Release highlights: composite-CJK page rendering (bundled Droid Sans
fallback for embed-less Type 0 fonts and Adobe predefined CIDFonts),
§11 transparency surface with optional lcms2 colour management,
cross-document font-cache /ToUnicode correctness, valid annotation
appearance streams, and math/CJK text-extraction polish.

Validated by a 423-PDF v0.3.63->v0.3.64 space-aware regression sweep
(word-Jaccard + content-ratio): zero regressions, no perf regression.

Closes #727
Closes #711
Closes #544
OSV-Scanner flagged RUSTSEC-2026-0176 — an out-of-bounds read in pyo3's
optimized `nth`/`nth_back` for `PyList`/`PyTuple` iterators (unchecked
`index + n`), fixed in pyo3 0.29.0.

Not reachable in our binding: it only builds lists via `PyList::empty` +
`append` and never calls `nth`/`nth_back` on a Python list/tuple iterator
with a caller-controlled index. The upstream fix (pyo3 0.29) is not yet
adoptable because pyo3-log (latest 0.13.3) requires pyo3 `>=0.26, <0.29`,
so taking it would drop the Rust->Python logging bridge.

Time-boxed ignore (review 2026-09-10); bump pyo3 + pyo3-log together once
pyo3-log adds 0.29 support. Not a regression from this PR — pyo3 0.28.3 is
the pre-existing pin on main.
OSV reports exactly two advisories for pyo3 0.28.3: RUSTSEC-2026-0176
(already ignored) and RUSTSEC-2026-0177 — a missing Sync bound on
PyCFunction::new_closure closures. We never call new_closure/new_closure_bound
(all Python callables are #[pyfunction]/#[pymethods]), so the unsoundness is
unreachable. Same pyo3 0.29 fix, same pyo3-log <0.29 blocker as 0176.
…substitution

The cjk-render-fallback substitution path resolved a CIDFont's CID to a
Unicode point only via the Adobe character-collection table
(CharacterCollection::cid_to_unicode). That is correct for a real
predefined CIDFont (Ryumin-Light, …) whose CIDs ARE the collection's CIDs,
but wrong for an Identity-encoded subset whose arbitrary CIDs only resolve
through the document's /ToUnicode CMap — there CID 1 is not collection CID
1, so the substitution painted the wrong glyph (often a blank space) and
the page came out empty.

Resolve CID -> Unicode by /ToUnicode first (authoritative per ISO 32000-1
§9.10.2, filtering U+FFFD/FFFE/FFFF placeholders), then fall back to the
Adobe collection table when the font ships no /ToUnicode (the common case
for the predefined CIDFonts the substitution targets). Fixes the
previously-never-run-in-CI render test surfaced by the new rendering tier
(#711); does not regress the predefined-collection path (#730 tests pass).
The Security Audit (cargo-audit) and Dependency Check (cargo-deny) gates
read their own ignore lists (.cargo/audit.toml / deny.toml), not the
osv-scanner.toml used by the OSV-Scanner gate. Add the same two pyo3 0.28.x
advisories (RUSTSEC-2026-0176 OOB read in nth/nth_back; RUSTSEC-2026-0177
missing Sync on PyCFunction::new_closure) to both, with the same
justification: vulnerable APIs unused in our binding, fix (pyo3 0.29)
blocked by pyo3-log capping at pyo3 <0.29. Review 2026-09-10.
@yfedoseev yfedoseev merged commit 8b3c4a8 into main Jun 12, 2026
241 checks passed
@yfedoseev yfedoseev deleted the release/v0.3.64 branch June 12, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant