Skip to content

fix(fonts): include /ToUnicode content in the cross-document font-cache key#733

Merged
yfedoseev merged 2 commits into
yfedoseev:mainfrom
regularkevvv:fix/font-cache-tounicode-collision
Jun 11, 2026
Merged

fix(fonts): include /ToUnicode content in the cross-document font-cache key#733
yfedoseev merged 2 commits into
yfedoseev:mainfrom
regularkevvv:fix/font-cache-tounicode-collision

Conversation

@regularkevvv

Copy link
Copy Markdown
Contributor

Description

The process-global font cache (fonts::global_cache) deduplicates parsed fonts across PdfDocument instances, keyed by a font identity hash. Building on the cross-document hardening in #595 / #597 / #598 — which folds the /ToUnicode reference into the key and keeps canonical AAAAAA+ subset fonts out of the shared cache — this extends the key to two cases those checks don't yet reach:

  • a non-canonical subset tag such as /CIDFont+F1 (emitted by some generators) isn't recognized as a subset, so it stays eligible for cross-document sharing; and
  • PDFs produced from a common template reuse the same /ToUnicode object number, so the reference-based key is identical for two genuinely different fonts.

When both hold, two fonts that merely share a /BaseFont name collide on one key; processed back-to-back in one long-lived process, a later document is served an earlier font's parsed FontInfo and its glyphs decode through the wrong /ToUnicode — a constant-offset cipher (SUMMARY6800$5<) or control/PUA characters. Each document extracts correctly in isolation, which is the signature of process-global cache state rather than a per-file parse issue. In a long-lived process (server, batch job) this silently produces plausible-looking wrong text.

Approach

font_identity_hash_with_descendants now folds the content of a font's document-specific streams into the key, not just their references:

  • the /ToUnicode stream bytes (the decisive discriminator),
  • the embedded font-program bytes (/FontFile, /FontFile2, /FontFile3) on simple fonts and on the descendant CIDFont, and
  • the descendant /Subtype (CIDFontType0 vs CIDFontType2).

Same-named-but-different fonts now get distinct keys regardless of subset-tag form or object reuse, while genuinely identical fonts still deduplicate across documents — so the cache keeps its purpose and there's no throughput change. Stream bytes are folded raw (still encoded): a sufficient discriminator that never causes a false dedup and avoids inflating large font programs on the cache-key path. font_identity_hash_cheap is unchanged and still seeds the hash, so prior identity coverage is preserved bit-for-bit.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Related Issues

Fixes #732 · Complements #595, #597, #598 (the cross-document cache hardening this builds on)

Changes Made

  • src/document.rsfont_identity_hash_with_descendants folds the /ToUnicode bytes, the embedded /FontFile{,2,3} bytes, and the descendant /Subtype; adds small helpers fold_stream_bytes / fold_font_program / resolve_indirect_for_hash. font_identity_hash_cheap is unchanged (it still seeds the hash, so prior identity coverage is preserved bit-for-bit).
  • tests/test_font_cache_cross_document.rs — in-code regression tests (hand-built Type0/CIDFontType2 PDFs, no binary fixtures): (1) several same-named fonts with different /ToUnicode, loaded back-to-back in one process, all decode correctly; (2) a byte-identical font hits the global cache (no new entry) while a font with a different /ToUnicode gets its own entry — pinning that the dedup win is preserved.
  • CHANGELOG.md[Unreleased] entry.

Testing

  • Added tests that prove the fix (they report several garbled fonts before the change, zero after).
  • All new and existing tests pass locally (cargo test --lib 5554; full cargo test green).
  • I have run cargo test --all-features (ran with default features; ml/ocr features need extra system deps locally — CI covers these).
  • I have run cargo clippy --all-targets -- -D warnings
  • I have run cargo fmt

Python Bindings (if applicable)

N/A — core Rust cache-key change, no binding API changes; the behaviour flows through to all bindings automatically.

Documentation

  • CHANGELOG.md [Unreleased]
  • Examples (not applicable)

Checklist

  • My code follows the project's coding guidelines (see CONTRIBUTING.md)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have checked my code and corrected any misspellings
  • The PR title follows conventional commits format

Additional Notes

The regression test builds its PDFs in memory from raw operators (per the no-binary-fixtures convention) — fully synthetic, no PII. A standalone, runnable PDF reproducer (two small PDFs + pip install "pdf_oxide==0.3.63"; python reproduce.py) is attached in a comment on #732 for convenience.

@regularkevvv regularkevvv requested a review from yfedoseev as a code owner June 10, 2026 23:34

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve — thank you, this is a careful fix for a nasty silent-corruption bug. 🙏

Verified the regression test actually guards the bug (not just passes on the fix): reverting only src/document.rs to pre-fix main and re-running, both tests fail hard — distinct_tounicode_fonts_do_not_collide_across_documents reports 3 of 4 fonts garbled (CID-as-codepoint fallback shows the offset directly: cid_base=1000 → U+03E8 Ϩ, 2000 → U+07D0 ߐ, 40000 → U+9C40 ), and the dedup test fails its 'different ToUnicode must decode' assertion. With the fix, both pass. That's exactly the constant-offset-cipher signature described, reproduced.

Why it's correct: the cross-document key folded only the /ToUnicode reference, and is_subset_basefont only excludes canonical AAAAAA+ subsets — so a non-canonical /CIDFont+F1 tag plus template object-number reuse made two genuinely different fonts share a key (Layer 6). Folding the /ToUnicode bytes, the embedded /FontFile{,2,3} bytes, and the descendant /Subtype gives them distinct keys while preserving real dedup (pinned by the second test). font_identity_hash_cheap is untouched, and the hash is memoized per font_ref so the byte-folding is once-per-font-per-document.

pdf.md-aligned: subset tag = six uppercase + + (§9.6.4), FontFile2=TrueType / FontFile3=CFF (Table 126), CIDFontType0/2 distinction (§9.7.4), /ToUnicode as the char→Unicode authority (§9.10.3) — all check out.

Two non-blocking notes for a future pass (neither gates this): (1) folding /FontFile* bytes means computing the key loads the embedded program even on a global-cache hit — fine, but if it ever shows in profiles, a /Length+prefix fold would keep most of the safety cheaper; (2) /CIDToGIDMap (§9.7.4.3, stream form) is a real Type2 glyph-mapping discriminator not folded — contrived, rendering-only, and pre-existing, but folding it would make descendant identity spec-complete.

Thanks again for the thorough writeup and the synthetic, no-fixture repro.

@yfedoseev

Copy link
Copy Markdown
Owner

Keeping my approval — this fix is correct and complete for #732, and I verified the regression test fails hard pre-fix (3/4 garbled, the CID-as-codepoint offset shows in the output) and passes after. 🎯

One optional completion while you're here (non-blocking): the descendant key folds /FontFile* and /Subtype, but not /CIDToGIDMap. Per ISO 32000-1 §9.7.4.3 a stream-form CIDToGIDMap changes CID→glyph selection, so two embedded CIDFontType2 fonts identical in everything you fold but differing in a custom CIDToGIDMap could still alias — the same key-missing-a-discriminator class as #732/#598, just rendering-only and far rarer. It's pre-existing (not introduced here), so this is genuinely optional.

If you'd like to close it in this PR, it's one more fold in the descendant section, reusing your helper:

// Descendant /CIDToGIDMap: stream form remaps CID→glyph (§9.7.4.3); the
// /Identity name folds nothing, so the common path is unaffected.
if let Some(c2g) = dd.get("CIDToGIDMap") {
    21u8.hash(&mut hasher);
    self.fold_stream_bytes(c2g, &mut hasher);
}

Totally fine to land #732's fix now and do /CIDToGIDMap as a follow-up instead — your call. Either way, thanks for the thorough writeup and the synthetic repro.

…he key

The process-global font cache keys parsed fonts by an identity hash that folds
the /ToUnicode reference (object id/gen) but not its bytes, and keeps only
canonical AAAAAA+ subset fonts out of cross-document sharing. A non-canonical
subset tag such as /CIDFont+F1 therefore stays shareable, and PDFs emitted from
a common template reuse the same /ToUnicode object number, so two genuinely
different fonts that merely share a /BaseFont name collide on one key. Processed
in one long-lived process, a later document is served an earlier font's parsed
FontInfo and its glyphs decode through the wrong /ToUnicode (a constant-offset
cipher such as "SUMMARY" -> "6800$5<", or control/PUA characters); each document
extracts correctly in isolation.

font_identity_hash_with_descendants now folds the content of a font's
document-specific streams into the key: the /ToUnicode stream bytes, the
embedded /FontFile{,2,3} bytes (simple fonts and the descendant CIDFont), the
descendant /Subtype, and a stream-form descendant /CIDToGIDMap (ISO 32000-1
§9.7.4.3). The /Identity name -- and an absent entry, which defaults to
Identity -- fold nothing, so the common path's key is unchanged. Same-named-but-
different fonts get distinct keys regardless of subset-tag form or object reuse,
while genuinely identical fonts still deduplicate across documents. Stream bytes
are folded raw to avoid inflating large font programs on the cache-key path;
font_identity_hash_cheap is unchanged.

Adds in-memory regression tests (no binary fixtures): same-named fonts with
different /ToUnicode decode correctly back-to-back in one process; an identical
font hits the global cache while a font with a different /ToUnicode -- or a
different stream-form /CIDToGIDMap -- gets its own entry.

Extends the cross-document cache hardening in yfedoseev#595, yfedoseev#597, yfedoseev#598.

Signed-off-by: Kevin Castro <hola@kev.pe>
@regularkevvv regularkevvv force-pushed the fix/font-cache-tounicode-collision branch from 983f31a to 22a33f8 Compare June 11, 2026 03:31
@regularkevvv

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review and for re-running the regression both ways 🙏

Done — folded the descendant /CIDToGIDMap into the identity key, reusing fold_stream_bytes as you suggested. One small refinement to the snippet: I guarded it with if !matches!(c2g, Object::Name(_)), so the /Identity name — and an absent entry, which defaults to Identity — fold nothing at all (not even the domain byte). That keeps an explicit /CIDToGIDMap /Identity deduping with an absent one (avoids a tiny new dedup split between the two) while still distinguishing the stream form:

// Descendant /CIDToGIDMap: the *stream* form remaps CID→glyph (§9.7.4.3),
// so two otherwise-identical embedded CIDFontType2 fonts with different maps
// select different glyphs and must not alias. The /Identity name — and an
// absent entry, which defaults to Identity — fold nothing.
if let Some(c2g) = dd.get("CIDToGIDMap") {
    if !matches!(c2g, Object::Name(_)) {
        21u8.hash(&mut hasher);
        self.fold_stream_bytes(c2g, &mut hasher);
    }
}

Added a cache-stats regression test (stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts): two fonts identical in name//ToUnicode/metrics but differing only in a stream-form /CIDToGIDMap must get separate cache entries. Verified it fails pre-fold (they alias → one entry) and passes after.

Also rebased onto the latest main (da027a64) and squashed into the single commit. cargo fmt / clippy --all-targets -- -D warnings clean; lib + the three cross-document tests green.

Left your other note (folding /FontFile* via /Length+prefix if it ever shows in profiles) as a future option — happy to follow up if you'd like.

@yfedoseev yfedoseev left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved — thank you, this is exactly right, and the refinement is better than my snippet. 🙏

Guarding with if !matches!(c2g, Object::Name(_)) so both the /Identity name and an absent entry fold nothing (not even the domain byte) is the correct call: it keeps an explicit /CIDToGIDMap /Identity deduping with an absent one (Identity is the default, §9.7.4.3) rather than splitting them, while still distinguishing the stream form. The new stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts test plus the build_type0_pdf(cid_base, cid_to_gid) parameterization exercise it directly. Descendant identity is now spec-complete for the cross-document key. CI will run the full suite; no further changes from me.

@yfedoseev yfedoseev merged commit 443c195 into yfedoseev:main Jun 11, 2026
281 of 283 checks passed
@yfedoseev yfedoseev mentioned this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Type0/Identity-H text garbles across documents sharing a non-canonical subset font name

2 participants