feat(render): substitute Adobe predefined CIDFonts via bundled CJK fallback#730
Conversation
Adds src/fonts/predefined_cidfont.rs: a curated name->character-collection registry covering the Adobe-Japan1, Adobe-GB1, Adobe-CNS1, and Adobe-Korea1 predefined CIDFonts that ISO 32000-2 §9.7.5.2 mandates conforming readers to support (Ryumin-Light, GothicBBB-Medium, HeiseiMin-W3, HeiseiKakuGo-W5, STSong-Light, STHeiti-Regular, SimSun, SimHei, MHei-Medium, MSung-Light, MingLiU, HYSMyeongJo-Medium, HYGoThic-Medium, Adobe-MyungjoStd-Medium, …). Producers concatenate Adobe predefined CMap names onto the base font name to form Type0 references like Ryumin-Light-Identity-V or STSong-Light-GBK-EUC-H. The matcher strips the longest known CMap suffix (so STSong-Light-GBK-EUC-H does not mis-strip the legacy single-letter -H suffix and leave an unresolvable STSong-Light-GBK-EUC) and a 6-character subset prefix before matching the bare base name. CharacterCollection::cid_to_unicode routes through the existing cid_mappings tables; this is the input the page-render substitution path will feed into Droid Sans Fallback's Unicode cmap. Also introduces the cjk-render-fallback cargo feature that subsequent commits will gate the bundled-font lookup behind. No behaviour change yet. 12 unit tests, all paths covered: bare names, suffixed names, subset prefixes, unrelated fonts, suffix-strip boundary protection.
…d time
Adds FontInfo.cjk_substitution: when from_dict() finishes parsing a Type0
font whose CIDFont descriptor has no /FontFile{,2,3} AND whose BaseFont
matches the Adobe predefined registry (post subset-prefix and CMap-suffix
strip), the field carries the resolved CharacterCollection. Every other
font keeps cjk_substitution: None and the existing decode/render paths
run unchanged.
The detection is pure metadata routing — no render-time behaviour yet.
The renderer consults this flag in a follow-up commit to route glyph
paint through the bundled covering font.
Existing FontInfo construction sites in unit and integration tests get
cjk_substitution: None appended; the field is intentionally non-optional
in the struct literal (no Default impl) so future renderers, extractors,
and writers that materialise FontInfo by hand cannot accidentally omit
the substitution flag and silently regress.
Adds TextRasterizer::render_substituted_cjk, gated on cjk-render-fallback. When render_text observes a FontInfo flagged for CJK substitution at load time (Type0 + no embedded outlines + base name in the Adobe predefined registry), the paint routes through this new method: CID -> Unicode (via the matching cid_mappings table for Adobe-Japan1 / Adobe-GB1 / Adobe-CNS1 / Adobe-Korea1) -> Droid Sans Fallback glyph_id (via the bundled font's Unicode cmap) -> outline paint through the existing tiny-skia path. Advance widths come from the PDF's own /W array so the substituted layout matches the source. Vertical writing mode (text_wmode == 1) advances along y with the same v_x / v_y origin offsets the embedded-CID path uses; the glyph shape itself is the horizontal form of Droid Sans Fallback (vertical variant glyphs would require a separate face we don't ship). When a CID has no Unicode mapping under the resolved collection, or Droid Sans Fallback has no glyph for the resolved Unicode point, the paint is skipped but the cursor still advances so subsequent glyphs land correctly — never a silent character drop without preserving layout. src/fonts/form_fallback.rs gains a feature-gated render_cjk_fallback_bytes accessor so the render path can include_bytes! the same DroidSans asset without coupling to the cjk-form-fonts feature. The two features are intentionally independent: enabling one does not activate the other, so the form-flatten path's behaviour is unchanged when only cjk-form-fonts is on. When cjk-render-fallback is off (the default), all the substitution code compiles out and render_text falls through to the existing routing - documents with predefined CIDFonts still render blank for those fonts (no regression vs. main, but no improvement either).
…xtures
Adds tests/test_cjk_predefined_cidfont_render.rs (gated on rendering +
cjk-render-fallback) plus two real-world fixtures (jo.pdf, kampo.pdf —
both pdfminer.six samples, MIT-licensed) and a synthetic Adobe-Japan1
PDF assembled inline.
Four tests:
1. font_info_flags_predefined_cidfont_for_substitution — synthetic PDF
parses without error, proving the new cjk_substitution field on
FontInfo doesn't break the existing construction paths.
2. synthetic_adobe_japan1_pdf_renders_non_blank — synthetic minimal PDF
with Ryumin-Light Identity-H and two CJK CIDs renders >= 1000
non-white pixels (sensitivity-verified at 7,340 with substitution
vs. 0 without — a clean detection of regression).
3. jo_pdf_vertical_japanese_renders_non_blank — pdfminer.six's classic
vertical-Japanese fixture (Ryumin-Light Identity-V) renders >= 2000
non-white pixels (sensitivity-verified at 119,186 vs. 0 without).
4. kampo_pdf_japanese_pharmacopeia_renders_non_blank — denser real-
world Japanese pharmacopeia text with Ryumin-Light + GothicBBB-Medium
+ HeiseiKakuGo-W5 references plus an embedded Latin Type1C subset.
Renders >= 5000 non-white pixels (sensitivity-verified at 277,397
with substitution vs. 21,212 without — the 256k delta is the CJK
content that was previously blank).
tests/fixtures/vertical_cjk/ATTRIBUTION.md documents fixture provenance
(pdfminer.six MIT corpus) and the Droid Sans Fallback Apache-2.0
license that ships in src/fonts/assets/.
rustdoc with -D warnings rejected the previous module-level doc comments:
- The '[`is_predefined`]' link in the inner //! docstring was resolved
against the parent fonts/mod.rs scope (since rustdoc walks pub mod
docstrings under the parent), where the function doesn't exist. Switch
to plain prose since the function is right below in the same file.
- The '[`super::cid_mappings`]' links failed the same way: rustdoc's
intra-doc resolver doesn't accept super:: in module-level //! comments
from another module's vantage point. Use crate::fonts::cid_mappings
instead — explicit, unambiguous, identical resolved target.
yfedoseev
left a comment
There was a problem hiding this comment.
Review: request changes
Great work — the intent is genuinely spec-motivated (ISO 32000-2 §9.7.5.2 requires Adobe-Japan1/GB1/CNS1/KR support), the BaseFont subset-prefix stripping matches §9.6.4, and substituting a covering font for a non-embedded CIDFont is a legitimate conforming-reader strategy. Registry longest-match / -H/-V boundary protection, feature-gating (clean no-op when the flag is off), and unwrap-safety all check out. Two blockers before merge (see inline comments), plus two lower-priority notes below.
Lower-priority (non-blocking)
Collection chosen from font name, not /CIDSystemInfo (font_dict.rs:1204). The character collection — i.e. which CID→Unicode table is used — is derived purely from the base-font name. The descendant font's /CIDSystemInfo Registry/Ordering is parsed and available in the same function but never consulted. A producer that names a font STSong-Light (→ Adobe-GB1) but pairs it with an Identity-ordered descendant would be interpreted against the wrong table with no diagnostic. Suggest cross-checking cid_system_info.ordering and preferring it (or declining substitution) on mismatch.
Returned advance omits horizontal scaling (Tz/Th) (text_rasterizer.rs, success path vs the bundled-face-missing branch). The success path accumulates x_cursor without h_scale, but the load-failure fallback routes through measure_text_bytes, which does apply h_scale. Per §9.4.4, tx = ((w0·Tfs)+Tc+Tw)·Th, so Th belongs in the returned text-space advance — and the two branches of the same function should agree. (render_cid_direct shares the no-h_scale convention, so this is partly pre-existing, but the new function disagrees with itself.)
Nice touch on the sensitivity-verified tests — toggling the path off and confirming the assertion fails is exactly right. The main gap is that only Identity-H is exercised (see blocker #2).
Three gate fixes for the predefined-CIDFont substitution detection in
FontInfo::from_dict:
- Gate on /FontFile{,2,3} key presence instead of the post-extraction
Option. A descriptor that declares a font program whose stream fails
to load or decode is a decode defect, not a substitution candidate;
substituting would mask the failure and silently degrade output. The
previously swallowed load/decode errors on the Type0 wrapper's
descriptor path are now logged, and a warning names the font when a
declared program could not be extracted.
- Require the /Encoding to resolve to an Identity charcode-to-CID
mapping (Identity-H/V or an Adobe-collection identity CMap stream).
Non-Identity predefined CMaps (90ms-RKSJ-H, GBK-EUC-H, B5pc-H, ...)
put raw Shift-JIS / EUC / Big5 codes in the content stream; indexing
the CID-to-Unicode tables with those values paints wrong glyphs with
no diagnostic. Such fonts are recognised by the name registry but
left unsubstituted until a charcode-to-CID CMap pass is wired.
- Derive the character collection from the descendant's /CIDSystemInfo
Ordering when it names a known collection, falling back to the
name-derived collection for Identity or unknown orderings. The
explicit CIDSystemInfo is authoritative for CID semantics per
ISO 32000-1 §9.7.3; a producer that names a font STSong-Light but
pairs it with a Japan1-ordered descendant now gets the Japan1 table.
parse_descendant_fonts and extract_embedded_font_from_descriptor now
report descriptor font-program key presence alongside the extracted
bytes so the gate can tell "no program" apart from "program present
but failed to decode".
… paint Per ISO 32000-1 §9.4.4 the horizontal displacement is tx = ((w0*Tfs)+Tc+Tw)*Th. render_substituted_cjk deferred Th to the per-glyph paint position but returned the unscaled cursor, so the text matrix advanced as if Tz were 100% while the painted glyphs were scaled - and the same function's bundled-face-missing fallback (measure_text_bytes) does apply Th, so the two branches disagreed. Scale the returned horizontal advance by Th; vertical advances keep no Th factor per the spec. Also document the code == CID invariant at the substitution paint loop: the load-time gate only flags fonts whose /Encoding resolved to Encoding::Identity, so the raw charcode is the CID by construction.
|
@yfedoseev all four findings addressed in
Full suite (8547 tests, |
|
CI note: the single red check, |
yfedoseev
left a comment
There was a problem hiding this comment.
Approve — thanks @RayVR, all four findings are genuinely fixed (verified in code, not just description; tests run green locally). 🙏
Blocker 1 (masked embedded-decode failure) — fixed. Substitution now gates on has_font_program derived from /FontFile{,2,3} key presence across both the wrapper and descendant descriptors (font_dict.rs:1294-1297, merged at :1127), and warns + declines when a declared program key is present but extraction fails (:1338-1344); the previously-swallowed .ok() errors are now surfaced. The ...key_present_but_unextractable test (dangling /FontFile3) pins Some→None.
Blocker 2 (raw char-code as CID for non-Identity CMaps) — fixed. Substitution restricted to Encoding::Identity (font_dict.rs:1297); parse_encoding maps only Identity-H/V there, so 90ms-RKSJ-H/GBK-EUC-H/B5pc-H fall to Standard and are recognized-but-skipped with a diagnostic (:1345-1351). The requires_identity_cmap_encoding test pins Ryumin-Light-90ms-RKSJ-H → None. The safe choice (decline until a charcode→CID pass is wired) is right — partial wiring would mis-key the CID-keyed /W widths too.
Non-blocking (3)(4) — addressed. /CIDSystemInfo Ordering now overrides the name-derived collection on mismatch (logged); horizontal advance scales by Th per §9.4.4 while vertical stays Th-free, matching the measure_text_bytes branch.
pdf.md-aligned: §9.4.4 tx=…×Th vs ty (no Th) (pdf.md:17442-48), §9.6.4 six-uppercase subset prefix (17852), §9.7.5.2 + Table 118 predefined CJK CMaps, and §9.7.4.2 (pdf.md:18676-81) which explicitly authorizes implementation-dependent glyph selection for a non-embedded predefined CIDFont — the substitution mechanism's spec basis.
One note for the record (no change needed): per pdf.md:19042-43 Identity-H "shall not be used with a non-embedded font", so the Ryumin-Light-Identity-H input is technically non-conforming — the PR is correctly being a lenient reader for these ubiquitous real-world files, which §9.7.4.2 permits. The bundled fallback stays opt-in behind cjk-render-fallback.
The lone red Java Lint check is a Maven Central 403 flake (PR touches no Java) — I'll re-run it.
…substitution The cjk-render-fallback substitution path resolved a CIDFont's CID to a Unicode point only via the Adobe character-collection table (CharacterCollection::cid_to_unicode). That is correct for a real predefined CIDFont (Ryumin-Light, …) whose CIDs ARE the collection's CIDs, but wrong for an Identity-encoded subset whose arbitrary CIDs only resolve through the document's /ToUnicode CMap — there CID 1 is not collection CID 1, so the substitution painted the wrong glyph (often a blank space) and the page came out empty. Resolve CID -> Unicode by /ToUnicode first (authoritative per ISO 32000-1 §9.10.2, filtering U+FFFD/FFFE/FFFF placeholders), then fall back to the Adobe collection table when the font ships no /ToUnicode (the common case for the predefined CIDFonts the substitution targets). Fixes the previously-never-run-in-CI render test surfaced by the new rendering tier (#711); does not regress the predefined-collection path (#730 tests pass).
Summary
Closes #727. When a Type 0 CIDFont references an Adobe predefined CIDFont name (Ryumin-Light, GothicBBB-Medium, STSong-Light, MHei-Medium, HYSMyeongJo-Medium, etc.) without embedding the actual glyph outlines, the page renderer can now substitute the bundled Droid Sans Fallback so the CJK content actually renders. Before this change, every glyph from such a font was silently dropped —
jo.pdfrendered fully blank,kampo.pdfrendered only the Latin numerals while the kanji body stayed empty.ISO 32000-2 §9.7.5.2 requires support for Adobe-CNS1-7, Adobe-GB1-5, Adobe-Japan1-7 and Adobe-KR-9 character collections — this PR brings the renderer in line with that requirement.
Approach
Reuses existing in-tree infrastructure:
src/fonts/assets/DroidSansFallbackFull.ttfwas already bundled (used by the form-flatten path).src/fonts/cid_mappings/adobe_japan1.rs,adobe_gb1.rs,adobe_cns1.rs,adobe_korea1.rsalready provide the CID → Unicode tables.src/fonts/form_fallback.rs::font_bytes(Fallback::Cjk)is the proven prior art for routing to the bundled CJK font in another path.Three new pieces:
src/fonts/predefined_cidfont.rs— name→character-collection registry covering ~30 known Adobe predefined CIDFont names across all four character collections, with subset-prefix (ABCDEF+) and CMap-suffix (-Identity-V,-Identity-H,-90ms-RKSJ-H, etc.) stripping. Longest-match preference protects the legacy single-letter-H/-Vsuffixes from mis-stripping (e.g.STSong-Light-GBK-EUC-Hcorrectly resolves toSTSong-Light).FontInfogets anOption<CharacterCollection>field populated when subtype is Type 0,embedded_font_data.is_none(), and the BaseFont matches the registry.TextRasterizer::render_substituted_cjk(feature-gated oncjk-render-fallback) — walks CIDs from the existingTextCharIter, maps CID → Unicode via the registry → glyph_id via Droid Sans Fallback'scmap→ paints through the existing tiny-skia outline path. Advance widths come from the PDF's own/W//DW. Vertical writing mode is honoured (advance along y,v_x/v_yorigin offsets applied). When the CID has no Unicode mapping OR Droid Sans has no glyph, paint is skipped but advance is preserved (so the layout doesn't collapse).The form-flatten path's existing behaviour is unchanged — the new feature flag (
cjk-render-fallback) is independent ofcjk-form-fonts. Both gates can be enabled together, alone, or neither. No new dependencies; the bundled font and the CID mapping tables were already in-tree.Verification
Tests render the new
tests/fixtures/vertical_cjk/jo.pdfandkampo.pdf(committed from the pdfminer.six MIT corpus — seetests/fixtures/vertical_cjk/ATTRIBUTION.md) and count non-white pixels in the pixmap as a coarse "is anything being drawn" check. Each test is sensitivity-verified by toggling the substitution path off and confirming the assertion fails.jo.pdf(vertical Japanese, Ryumin-Light-Identity-V only)kampo.pdf(mixed Ryumin / GothicBBB-Medium / HeiseiKakuGo-W5 + Latin subset)Plus 12 unit tests in
predefined_cidfont.rs::testscovering bare and suffixed names per collection, subset-prefix strip, suffix-strip boundary protection, and the CID → Unicode chain.Pre-push gates
cargo fmt --checkcargo clippy --features rendering,test-support,cjk-render-fallback --all-targets -- -D warningscargo check --lib --no-default-featurescargo test --features rendering,test-support,cjk-render-fallbackcargo test --features rendering,test-support(regression guard)RUSTDOCFLAGS="-D warnings" cargo doc --no-depsRUSTDOCFLAGS="-D warnings" cargo doc --no-deps --no-default-featuresNote: the rendering test tier still doesn't run in CI (per #711); the suite above is local on macOS.
Documented trade-offs
vert/vrt2OpenType features) are not provided. The glyph shapes themselves are the horizontal form.cjk-render-fallback. Documents that embed their own outlines never touch this code path.Why feature-gated
Binary-size sensitivity. The 3.4 MB Droid Sans Fallback only matters for consumers actually rendering CJK; embedded-pipeline and OCR-only consumers don't need it in their binary. The form-flatten path already follows the same opt-in pattern via
cjk-form-fonts.Test plan
jo.pdf(Miyazawa Kenji preface) renders non-blankkampo.pdf(Japan Official Gazette) renders kanji content (not just Latin numerals)cjk-render-fallbackdisabled-D warningsfor both default and--no-default-features