Skip to content

feat(render): substitute Adobe predefined CIDFonts via bundled CJK fallback#730

Merged
yfedoseev merged 9 commits into
yfedoseev:mainfrom
RayVR:feature/cjk-predefined-cidfont-fallback
Jun 11, 2026
Merged

feat(render): substitute Adobe predefined CIDFonts via bundled CJK fallback#730
yfedoseev merged 9 commits into
yfedoseev:mainfrom
RayVR:feature/cjk-predefined-cidfont-fallback

Conversation

@RayVR

@RayVR RayVR commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #727. When a Type 0 CIDFont references an Adobe predefined CIDFont name (Ryumin-Light, GothicBBB-Medium, STSong-Light, MHei-Medium, HYSMyeongJo-Medium, etc.) without embedding the actual glyph outlines, the page renderer can now substitute the bundled Droid Sans Fallback so the CJK content actually renders. Before this change, every glyph from such a font was silently dropped — jo.pdf rendered fully blank, kampo.pdf rendered only the Latin numerals while the kanji body stayed empty.

ISO 32000-2 §9.7.5.2 requires support for Adobe-CNS1-7, Adobe-GB1-5, Adobe-Japan1-7 and Adobe-KR-9 character collections — this PR brings the renderer in line with that requirement.

Approach

Reuses existing in-tree infrastructure:

  • src/fonts/assets/DroidSansFallbackFull.ttf was already bundled (used by the form-flatten path).
  • src/fonts/cid_mappings/adobe_japan1.rs, adobe_gb1.rs, adobe_cns1.rs, adobe_korea1.rs already provide the CID → Unicode tables.
  • src/fonts/form_fallback.rs::font_bytes(Fallback::Cjk) is the proven prior art for routing to the bundled CJK font in another path.

Three new pieces:

  1. src/fonts/predefined_cidfont.rs — name→character-collection registry covering ~30 known Adobe predefined CIDFont names across all four character collections, with subset-prefix (ABCDEF+) and CMap-suffix (-Identity-V, -Identity-H, -90ms-RKSJ-H, etc.) stripping. Longest-match preference protects the legacy single-letter -H / -V suffixes from mis-stripping (e.g. STSong-Light-GBK-EUC-H correctly resolves to STSong-Light).
  2. Detection at font-load timeFontInfo gets an Option<CharacterCollection> field populated when subtype is Type 0, embedded_font_data.is_none(), and the BaseFont matches the registry.
  3. TextRasterizer::render_substituted_cjk (feature-gated on cjk-render-fallback) — walks CIDs from the existing TextCharIter, maps CID → Unicode via the registry → glyph_id via Droid Sans Fallback's cmap → paints through the existing tiny-skia outline path. Advance widths come from the PDF's own /W / /DW. Vertical writing mode is honoured (advance along y, v_x/v_y origin offsets applied). When the CID has no Unicode mapping OR Droid Sans has no glyph, paint is skipped but advance is preserved (so the layout doesn't collapse).

The form-flatten path's existing behaviour is unchanged — the new feature flag (cjk-render-fallback) is independent of cjk-form-fonts. Both gates can be enabled together, alone, or neither. No new dependencies; the bundled font and the CID mapping tables were already in-tree.

Verification

Tests render the new tests/fixtures/vertical_cjk/jo.pdf and kampo.pdf (committed from the pdfminer.six MIT corpus — see tests/fixtures/vertical_cjk/ATTRIBUTION.md) and count non-white pixels in the pixmap as a coarse "is anything being drawn" check. Each test is sensitivity-verified by toggling the substitution path off and confirming the assertion fails.

Test Pre-fix non-white px Post-fix Δ
Synthetic Adobe-Japan1 PDF 0 7,340 +7,340
jo.pdf (vertical Japanese, Ryumin-Light-Identity-V only) 0 119,186 +119,186
kampo.pdf (mixed Ryumin / GothicBBB-Medium / HeiseiKakuGo-W5 + Latin subset) 21,212 (Latin only) 277,397 +256,185

Plus 12 unit tests in predefined_cidfont.rs::tests covering bare and suffixed names per collection, subset-prefix strip, suffix-strip boundary protection, and the CID → Unicode chain.

Pre-push gates

Gate Result
cargo fmt --check green
cargo clippy --features rendering,test-support,cjk-render-fallback --all-targets -- -D warnings green
cargo check --lib --no-default-features green
cargo test --features rendering,test-support,cjk-render-fallback 288 test binaries, 0 failures
cargo test --features rendering,test-support (regression guard) 0 failures
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps green
RUSTDOCFLAGS="-D warnings" cargo doc --no-deps --no-default-features green

Note: the rendering test tier still doesn't run in CI (per #711); the suite above is local on macOS.

Documented trade-offs

  • Style fidelity: Droid Sans Fallback is a sans-serif face. When a Mincho/Song serif typeface like Ryumin-Light is requested, the rendered output is sans-serif. The module header documents this explicitly. The alternative (continue dropping every glyph) is worse for every reasonable consumer.
  • Vertical-form glyph variants: Droid Sans Fallback is a horizontal font. Layout advances vertically via the existing WMode-1 routing, but vertical-form glyph variants (rotated punctuation, vertical-baseline-shifted parens via vert / vrt2 OpenType features) are not provided. The glyph shapes themselves are the horizontal form.
  • No system-font integration: deliberate. Keeping the binary self-contained means no CoreText / DirectWrite / fontconfig dependency and deterministic output regardless of host. Trade-off: ~3.4 MB binary size when the feature is enabled.
  • Existing rendering paths unchanged when the feature is off: the substitution check is a no-op without cjk-render-fallback. Documents that embed their own outlines never touch this code path.

Why feature-gated

Binary-size sensitivity. The 3.4 MB Droid Sans Fallback only matters for consumers actually rendering CJK; embedded-pipeline and OCR-only consumers don't need it in their binary. The form-flatten path already follows the same opt-in pattern via cjk-form-fonts.

Test plan

  • Unit tests for the predefined-CIDFont name registry (12 tests, all positive and negative cases)
  • Synthetic minimal Adobe-Japan1 PDF renders non-blank
  • Real-world jo.pdf (Miyazawa Kenji preface) renders non-blank
  • Real-world kampo.pdf (Japan Official Gazette) renders kanji content (not just Latin numerals)
  • All three integration tests sensitivity-verified by toggling the substitution off
  • Regression guard: full test suite passes with cjk-render-fallback disabled
  • Doc build clean under -D warnings for both default and --no-default-features

RayVR added 6 commits June 10, 2026 15:36
Adds src/fonts/predefined_cidfont.rs: a curated name->character-collection
registry covering the Adobe-Japan1, Adobe-GB1, Adobe-CNS1, and Adobe-Korea1
predefined CIDFonts that ISO 32000-2 §9.7.5.2 mandates conforming readers to
support (Ryumin-Light, GothicBBB-Medium, HeiseiMin-W3, HeiseiKakuGo-W5,
STSong-Light, STHeiti-Regular, SimSun, SimHei, MHei-Medium, MSung-Light,
MingLiU, HYSMyeongJo-Medium, HYGoThic-Medium, Adobe-MyungjoStd-Medium, …).

Producers concatenate Adobe predefined CMap names onto the base font name to
form Type0 references like Ryumin-Light-Identity-V or STSong-Light-GBK-EUC-H.
The matcher strips the longest known CMap suffix (so STSong-Light-GBK-EUC-H
does not mis-strip the legacy single-letter -H suffix and leave an
unresolvable STSong-Light-GBK-EUC) and a 6-character subset prefix before
matching the bare base name.

CharacterCollection::cid_to_unicode routes through the existing
cid_mappings tables; this is the input the page-render substitution path
will feed into Droid Sans Fallback's Unicode cmap.

Also introduces the cjk-render-fallback cargo feature that subsequent
commits will gate the bundled-font lookup behind. No behaviour change yet.
12 unit tests, all paths covered: bare names, suffixed names, subset
prefixes, unrelated fonts, suffix-strip boundary protection.
…d time

Adds FontInfo.cjk_substitution: when from_dict() finishes parsing a Type0
font whose CIDFont descriptor has no /FontFile{,2,3} AND whose BaseFont
matches the Adobe predefined registry (post subset-prefix and CMap-suffix
strip), the field carries the resolved CharacterCollection. Every other
font keeps cjk_substitution: None and the existing decode/render paths
run unchanged.

The detection is pure metadata routing — no render-time behaviour yet.
The renderer consults this flag in a follow-up commit to route glyph
paint through the bundled covering font.

Existing FontInfo construction sites in unit and integration tests get
cjk_substitution: None appended; the field is intentionally non-optional
in the struct literal (no Default impl) so future renderers, extractors,
and writers that materialise FontInfo by hand cannot accidentally omit
the substitution flag and silently regress.
Adds TextRasterizer::render_substituted_cjk, gated on cjk-render-fallback.
When render_text observes a FontInfo flagged for CJK substitution at load
time (Type0 + no embedded outlines + base name in the Adobe predefined
registry), the paint routes through this new method:

  CID -> Unicode (via the matching cid_mappings table for Adobe-Japan1 /
  Adobe-GB1 / Adobe-CNS1 / Adobe-Korea1) -> Droid Sans Fallback glyph_id
  (via the bundled font's Unicode cmap) -> outline paint through the
  existing tiny-skia path.

Advance widths come from the PDF's own /W array so the substituted layout
matches the source. Vertical writing mode (text_wmode == 1) advances along
y with the same v_x / v_y origin offsets the embedded-CID path uses; the
glyph shape itself is the horizontal form of Droid Sans Fallback (vertical
variant glyphs would require a separate face we don't ship).

When a CID has no Unicode mapping under the resolved collection, or Droid
Sans Fallback has no glyph for the resolved Unicode point, the paint is
skipped but the cursor still advances so subsequent glyphs land
correctly — never a silent character drop without preserving layout.

src/fonts/form_fallback.rs gains a feature-gated render_cjk_fallback_bytes
accessor so the render path can include_bytes! the same DroidSans asset
without coupling to the cjk-form-fonts feature. The two features are
intentionally independent: enabling one does not activate the other, so
the form-flatten path's behaviour is unchanged when only cjk-form-fonts
is on.

When cjk-render-fallback is off (the default), all the substitution code
compiles out and render_text falls through to the existing routing -
documents with predefined CIDFonts still render blank for those fonts (no
regression vs. main, but no improvement either).
…xtures

Adds tests/test_cjk_predefined_cidfont_render.rs (gated on rendering +
cjk-render-fallback) plus two real-world fixtures (jo.pdf, kampo.pdf —
both pdfminer.six samples, MIT-licensed) and a synthetic Adobe-Japan1
PDF assembled inline.

Four tests:

  1. font_info_flags_predefined_cidfont_for_substitution — synthetic PDF
     parses without error, proving the new cjk_substitution field on
     FontInfo doesn't break the existing construction paths.

  2. synthetic_adobe_japan1_pdf_renders_non_blank — synthetic minimal PDF
     with Ryumin-Light Identity-H and two CJK CIDs renders >= 1000
     non-white pixels (sensitivity-verified at 7,340 with substitution
     vs. 0 without — a clean detection of regression).

  3. jo_pdf_vertical_japanese_renders_non_blank — pdfminer.six's classic
     vertical-Japanese fixture (Ryumin-Light Identity-V) renders >= 2000
     non-white pixels (sensitivity-verified at 119,186 vs. 0 without).

  4. kampo_pdf_japanese_pharmacopeia_renders_non_blank — denser real-
     world Japanese pharmacopeia text with Ryumin-Light + GothicBBB-Medium
     + HeiseiKakuGo-W5 references plus an embedded Latin Type1C subset.
     Renders >= 5000 non-white pixels (sensitivity-verified at 277,397
     with substitution vs. 21,212 without — the 256k delta is the CJK
     content that was previously blank).

tests/fixtures/vertical_cjk/ATTRIBUTION.md documents fixture provenance
(pdfminer.six MIT corpus) and the Droid Sans Fallback Apache-2.0
license that ships in src/fonts/assets/.
rustdoc with -D warnings rejected the previous module-level doc comments:

  - The '[`is_predefined`]' link in the inner //! docstring was resolved
    against the parent fonts/mod.rs scope (since rustdoc walks pub mod
    docstrings under the parent), where the function doesn't exist. Switch
    to plain prose since the function is right below in the same file.

  - The '[`super::cid_mappings`]' links failed the same way: rustdoc's
    intra-doc resolver doesn't accept super:: in module-level //! comments
    from another module's vantage point. Use crate::fonts::cid_mappings
    instead — explicit, unambiguous, identical resolved target.
@RayVR RayVR requested a review from yfedoseev as a code owner June 10, 2026 11:19

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: request changes

Great work — the intent is genuinely spec-motivated (ISO 32000-2 §9.7.5.2 requires Adobe-Japan1/GB1/CNS1/KR support), the BaseFont subset-prefix stripping matches §9.6.4, and substituting a covering font for a non-embedded CIDFont is a legitimate conforming-reader strategy. Registry longest-match / -H/-V boundary protection, feature-gating (clean no-op when the flag is off), and unwrap-safety all check out. Two blockers before merge (see inline comments), plus two lower-priority notes below.

Lower-priority (non-blocking)

Collection chosen from font name, not /CIDSystemInfo (font_dict.rs:1204). The character collection — i.e. which CID→Unicode table is used — is derived purely from the base-font name. The descendant font's /CIDSystemInfo Registry/Ordering is parsed and available in the same function but never consulted. A producer that names a font STSong-Light (→ Adobe-GB1) but pairs it with an Identity-ordered descendant would be interpreted against the wrong table with no diagnostic. Suggest cross-checking cid_system_info.ordering and preferring it (or declining substitution) on mismatch.

Returned advance omits horizontal scaling (Tz/Th) (text_rasterizer.rs, success path vs the bundled-face-missing branch). The success path accumulates x_cursor without h_scale, but the load-failure fallback routes through measure_text_bytes, which does apply h_scale. Per §9.4.4, tx = ((w0·Tfs)+Tc+Tw)·Th, so Th belongs in the returned text-space advance — and the two branches of the same function should agree. (render_cid_direct shares the no-h_scale convention, so this is partly pre-existing, but the new function disagrees with itself.)

Nice touch on the sensitivity-verified tests — toggling the path off and confirming the assertion fails is exactly right. The main gap is that only Identity-H is exercised (see blocker #2).

Comment thread src/fonts/font_dict.rs Outdated
Comment thread src/rendering/text_rasterizer.rs
RayVR added 2 commits June 11, 2026 18:17
Three gate fixes for the predefined-CIDFont substitution detection in
FontInfo::from_dict:

- Gate on /FontFile{,2,3} key presence instead of the post-extraction
  Option. A descriptor that declares a font program whose stream fails
  to load or decode is a decode defect, not a substitution candidate;
  substituting would mask the failure and silently degrade output. The
  previously swallowed load/decode errors on the Type0 wrapper's
  descriptor path are now logged, and a warning names the font when a
  declared program could not be extracted.

- Require the /Encoding to resolve to an Identity charcode-to-CID
  mapping (Identity-H/V or an Adobe-collection identity CMap stream).
  Non-Identity predefined CMaps (90ms-RKSJ-H, GBK-EUC-H, B5pc-H, ...)
  put raw Shift-JIS / EUC / Big5 codes in the content stream; indexing
  the CID-to-Unicode tables with those values paints wrong glyphs with
  no diagnostic. Such fonts are recognised by the name registry but
  left unsubstituted until a charcode-to-CID CMap pass is wired.

- Derive the character collection from the descendant's /CIDSystemInfo
  Ordering when it names a known collection, falling back to the
  name-derived collection for Identity or unknown orderings. The
  explicit CIDSystemInfo is authoritative for CID semantics per
  ISO 32000-1 §9.7.3; a producer that names a font STSong-Light but
  pairs it with a Japan1-ordered descendant now gets the Japan1 table.

parse_descendant_fonts and extract_embedded_font_from_descriptor now
report descriptor font-program key presence alongside the extracted
bytes so the gate can tell "no program" apart from "program present
but failed to decode".
… paint

Per ISO 32000-1 §9.4.4 the horizontal displacement is
tx = ((w0*Tfs)+Tc+Tw)*Th. render_substituted_cjk deferred Th to the
per-glyph paint position but returned the unscaled cursor, so the text
matrix advanced as if Tz were 100% while the painted glyphs were
scaled - and the same function's bundled-face-missing fallback
(measure_text_bytes) does apply Th, so the two branches disagreed.
Scale the returned horizontal advance by Th; vertical advances keep
no Th factor per the spec.

Also document the code == CID invariant at the substitution paint
loop: the load-time gate only flags fonts whose /Encoding resolved to
Encoding::Identity, so the raw charcode is the CID by construction.
@RayVR

RayVR commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@yfedoseev all four findings addressed in 8d732de0..09396218:

Sev Finding Fix Sensitivity-verified
🔴 Substitution masks failed embedded-font decode Gate on /FontFile{,2,3} key presence across both descriptors; swallowed .ok() errors now logged; warn + decline when a declared program fails to extract (8d732de0) unloadable-/FontFile3 test flipped Some→None
🔴 Raw charcode used as CID for non-Identity CMaps Substitution restricted to Encoding::Identity; non-Identity predefined CMaps recognised but skipped with a diagnostic until a charcode→CID CMap pass is wired — /W widths are CID-keyed too, so partial wiring can't be correct (8d732de0, detail in the inline thread) 90ms-RKSJ-H test flipped Some→None
🟡 Collection from font name, not /CIDSystemInfo Known /CIDSystemInfo Ordering now overrides the name-derived collection (logged on mismatch); Identity/unknown orderings keep the name fallback (8d732de0) Ryumin-Light + GB1 ordering test flipped AdobeJapan1→AdobeGB1
🟡 Returned advance omits Th Horizontal advance returned by render_substituted_cjk now scales by Th per §9.4.4, matching its own measure_text_bytes fallback; vertical stays Th-free (09396218) Tz=50% test: advance was 20.0 both ways pre-fix, now 10.0

Full suite (8547 tests, rendering,cjk-render-fallback), default-feature lib tests, clippy -D warnings, --no-default-features lib check, and rustdoc -D warnings are all green locally. The fixture renders still pass with margin: jo.pdf 119k and kampo.pdf 277k non-white pixels against the 2k/5k gates.

@RayVR

RayVR commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

CI note: the single red check, Java Lint (Spotless + SpotBugs), failed in its first step — Maven Central returned 403 Forbidden while fetching the Spotless plugin POM itself, before any lint ran (job log). This PR touches no Java; a re-run of the failed job should clear it (I can't trigger re-runs from a fork). The other 178 checks are green.

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve — thanks @RayVR, all four findings are genuinely fixed (verified in code, not just description; tests run green locally). 🙏

Blocker 1 (masked embedded-decode failure) — fixed. Substitution now gates on has_font_program derived from /FontFile{,2,3} key presence across both the wrapper and descendant descriptors (font_dict.rs:1294-1297, merged at :1127), and warns + declines when a declared program key is present but extraction fails (:1338-1344); the previously-swallowed .ok() errors are now surfaced. The ...key_present_but_unextractable test (dangling /FontFile3) pins Some→None.

Blocker 2 (raw char-code as CID for non-Identity CMaps) — fixed. Substitution restricted to Encoding::Identity (font_dict.rs:1297); parse_encoding maps only Identity-H/V there, so 90ms-RKSJ-H/GBK-EUC-H/B5pc-H fall to Standard and are recognized-but-skipped with a diagnostic (:1345-1351). The requires_identity_cmap_encoding test pins Ryumin-Light-90ms-RKSJ-H → None. The safe choice (decline until a charcode→CID pass is wired) is right — partial wiring would mis-key the CID-keyed /W widths too.

Non-blocking (3)(4) — addressed. /CIDSystemInfo Ordering now overrides the name-derived collection on mismatch (logged); horizontal advance scales by Th per §9.4.4 while vertical stays Th-free, matching the measure_text_bytes branch.

pdf.md-aligned: §9.4.4 tx=…×Th vs ty (no Th) (pdf.md:17442-48), §9.6.4 six-uppercase subset prefix (17852), §9.7.5.2 + Table 118 predefined CJK CMaps, and §9.7.4.2 (pdf.md:18676-81) which explicitly authorizes implementation-dependent glyph selection for a non-embedded predefined CIDFont — the substitution mechanism's spec basis.

One note for the record (no change needed): per pdf.md:19042-43 Identity-H "shall not be used with a non-embedded font", so the Ryumin-Light-Identity-H input is technically non-conforming — the PR is correctly being a lenient reader for these ubiquitous real-world files, which §9.7.4.2 permits. The bundled fallback stays opt-in behind cjk-render-fallback.

The lone red Java Lint check is a Maven Central 403 flake (PR touches no Java) — I'll re-run it.

@yfedoseev yfedoseev merged commit e12609e into yfedoseev:main Jun 11, 2026
188 checks passed
@yfedoseev yfedoseev mentioned this pull request Jun 12, 2026
yfedoseev added a commit that referenced this pull request Jun 12, 2026
…substitution

The cjk-render-fallback substitution path resolved a CIDFont's CID to a
Unicode point only via the Adobe character-collection table
(CharacterCollection::cid_to_unicode). That is correct for a real
predefined CIDFont (Ryumin-Light, …) whose CIDs ARE the collection's CIDs,
but wrong for an Identity-encoded subset whose arbitrary CIDs only resolve
through the document's /ToUnicode CMap — there CID 1 is not collection CID
1, so the substitution painted the wrong glyph (often a blank space) and
the page came out empty.

Resolve CID -> Unicode by /ToUnicode first (authoritative per ISO 32000-1
§9.10.2, filtering U+FFFD/FFFE/FFFF placeholders), then fall back to the
Adobe collection table when the font ships no /ToUnicode (the common case
for the predefined CIDFonts the substitution targets). Fixes the
previously-never-run-in-CI render test surfaced by the new rendering tier
(#711); does not regress the predefined-collection path (#730 tests pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rendering: predefined Adobe CIDFonts (Ryumin-Light / GothicBBB-Medium / STSong-Light / etc.) render blank — ISO 32000-2 §9.7.5.2 substitution missing

2 participants