fix(fonts): include /ToUnicode content in the cross-document font-cache key by regularkevvv · Pull Request #733 · yfedoseev/pdf_oxide

regularkevvv · 2026-06-10T23:34:27Z

Description

The process-global font cache (fonts::global_cache) deduplicates parsed fonts across PdfDocument instances, keyed by a font identity hash. Building on the cross-document hardening in #595 / #597 / #598 — which folds the /ToUnicode reference into the key and keeps canonical AAAAAA+ subset fonts out of the shared cache — this extends the key to two cases those checks don't yet reach:

a non-canonical subset tag such as /CIDFont+F1 (emitted by some generators) isn't recognized as a subset, so it stays eligible for cross-document sharing; and
PDFs produced from a common template reuse the same /ToUnicode object number, so the reference-based key is identical for two genuinely different fonts.

When both hold, two fonts that merely share a /BaseFont name collide on one key; processed back-to-back in one long-lived process, a later document is served an earlier font's parsed FontInfo and its glyphs decode through the wrong /ToUnicode — a constant-offset cipher (SUMMARY → 6800$5<) or control/PUA characters. Each document extracts correctly in isolation, which is the signature of process-global cache state rather than a per-file parse issue. In a long-lived process (server, batch job) this silently produces plausible-looking wrong text.

Approach

font_identity_hash_with_descendants now folds the content of a font's document-specific streams into the key, not just their references:

the /ToUnicode stream bytes (the decisive discriminator),
the embedded font-program bytes (/FontFile, /FontFile2, /FontFile3) on simple fonts and on the descendant CIDFont, and
the descendant /Subtype (CIDFontType0 vs CIDFontType2).

Same-named-but-different fonts now get distinct keys regardless of subset-tag form or object reuse, while genuinely identical fonts still deduplicate across documents — so the cache keeps its purpose and there's no throughput change. Stream bytes are folded raw (still encoded): a sufficient discriminator that never causes a false dedup and avoids inflating large font programs on the cache-key path. font_identity_hash_cheap is unchanged and still seeds the hash, so prior identity coverage is preserved bit-for-bit.

Type of Change

Bug fix (non-breaking change which fixes an issue)

Related Issues

Fixes #732 · Complements #595, #597, #598 (the cross-document cache hardening this builds on)

Changes Made

src/document.rs — font_identity_hash_with_descendants folds the /ToUnicode bytes, the embedded /FontFile{,2,3} bytes, and the descendant /Subtype; adds small helpers fold_stream_bytes / fold_font_program / resolve_indirect_for_hash. font_identity_hash_cheap is unchanged (it still seeds the hash, so prior identity coverage is preserved bit-for-bit).
tests/test_font_cache_cross_document.rs — in-code regression tests (hand-built Type0/CIDFontType2 PDFs, no binary fixtures): (1) several same-named fonts with different /ToUnicode, loaded back-to-back in one process, all decode correctly; (2) a byte-identical font hits the global cache (no new entry) while a font with a different /ToUnicode gets its own entry — pinning that the dedup win is preserved.
CHANGELOG.md — [Unreleased] entry.

Testing

Added tests that prove the fix (they report several garbled fonts before the change, zero after).
All new and existing tests pass locally (cargo test --lib 5554; full cargo test green).
I have run cargo test --all-features (ran with default features; ml/ocr features need extra system deps locally — CI covers these).
I have run cargo clippy --all-targets -- -D warnings
I have run cargo fmt

Python Bindings (if applicable)

N/A — core Rust cache-key change, no binding API changes; the behaviour flows through to all bindings automatically.

Documentation

CHANGELOG.md [Unreleased]
Examples (not applicable)

Checklist

My code follows the project's coding guidelines (see CONTRIBUTING.md)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have checked my code and corrected any misspellings
The PR title follows conventional commits format

Additional Notes

The regression test builds its PDFs in memory from raw operators (per the no-binary-fixtures convention) — fully synthetic, no PII. A standalone, runnable PDF reproducer (two small PDFs + pip install "pdf_oxide==0.3.63"; python reproduce.py) is attached in a comment on #732 for convenience.

yfedoseev

Approve — thank you, this is a careful fix for a nasty silent-corruption bug. 🙏

Verified the regression test actually guards the bug (not just passes on the fix): reverting only src/document.rs to pre-fix main and re-running, both tests fail hard — distinct_tounicode_fonts_do_not_collide_across_documents reports 3 of 4 fonts garbled (CID-as-codepoint fallback shows the offset directly: cid_base=1000 → U+03E8 Ϩ, 2000 → U+07D0 ߐ, 40000 → U+9C40 鱀), and the dedup test fails its 'different ToUnicode must decode' assertion. With the fix, both pass. That's exactly the constant-offset-cipher signature described, reproduced.

Why it's correct: the cross-document key folded only the /ToUnicode reference, and is_subset_basefont only excludes canonical AAAAAA+ subsets — so a non-canonical /CIDFont+F1 tag plus template object-number reuse made two genuinely different fonts share a key (Layer 6). Folding the /ToUnicode bytes, the embedded /FontFile{,2,3} bytes, and the descendant /Subtype gives them distinct keys while preserving real dedup (pinned by the second test). font_identity_hash_cheap is untouched, and the hash is memoized per font_ref so the byte-folding is once-per-font-per-document.

pdf.md-aligned: subset tag = six uppercase + + (§9.6.4), FontFile2=TrueType / FontFile3=CFF (Table 126), CIDFontType0/2 distinction (§9.7.4), /ToUnicode as the char→Unicode authority (§9.10.3) — all check out.

Two non-blocking notes for a future pass (neither gates this): (1) folding /FontFile* bytes means computing the key loads the embedded program even on a global-cache hit — fine, but if it ever shows in profiles, a /Length+prefix fold would keep most of the safety cheaper; (2) /CIDToGIDMap (§9.7.4.3, stream form) is a real Type2 glyph-mapping discriminator not folded — contrived, rendering-only, and pre-existing, but folding it would make descendant identity spec-complete.

Thanks again for the thorough writeup and the synthetic, no-fixture repro.

yfedoseev · 2026-06-11T00:57:09Z

Keeping my approval — this fix is correct and complete for #732, and I verified the regression test fails hard pre-fix (3/4 garbled, the CID-as-codepoint offset shows in the output) and passes after. 🎯

One optional completion while you're here (non-blocking): the descendant key folds /FontFile* and /Subtype, but not /CIDToGIDMap. Per ISO 32000-1 §9.7.4.3 a stream-form CIDToGIDMap changes CID→glyph selection, so two embedded CIDFontType2 fonts identical in everything you fold but differing in a custom CIDToGIDMap could still alias — the same key-missing-a-discriminator class as #732/#598, just rendering-only and far rarer. It's pre-existing (not introduced here), so this is genuinely optional.

If you'd like to close it in this PR, it's one more fold in the descendant section, reusing your helper:

// Descendant /CIDToGIDMap: stream form remaps CID→glyph (§9.7.4.3); the
// /Identity name folds nothing, so the common path is unaffected.
if let Some(c2g) = dd.get("CIDToGIDMap") {
    21u8.hash(&mut hasher);
    self.fold_stream_bytes(c2g, &mut hasher);
}

Totally fine to land #732's fix now and do /CIDToGIDMap as a follow-up instead — your call. Either way, thanks for the thorough writeup and the synthetic repro.

…he key The process-global font cache keys parsed fonts by an identity hash that folds the /ToUnicode reference (object id/gen) but not its bytes, and keeps only canonical AAAAAA+ subset fonts out of cross-document sharing. A non-canonical subset tag such as /CIDFont+F1 therefore stays shareable, and PDFs emitted from a common template reuse the same /ToUnicode object number, so two genuinely different fonts that merely share a /BaseFont name collide on one key. Processed in one long-lived process, a later document is served an earlier font's parsed FontInfo and its glyphs decode through the wrong /ToUnicode (a constant-offset cipher such as "SUMMARY" -> "6800$5<", or control/PUA characters); each document extracts correctly in isolation. font_identity_hash_with_descendants now folds the content of a font's document-specific streams into the key: the /ToUnicode stream bytes, the embedded /FontFile{,2,3} bytes (simple fonts and the descendant CIDFont), the descendant /Subtype, and a stream-form descendant /CIDToGIDMap (ISO 32000-1 §9.7.4.3). The /Identity name -- and an absent entry, which defaults to Identity -- fold nothing, so the common path's key is unchanged. Same-named-but- different fonts get distinct keys regardless of subset-tag form or object reuse, while genuinely identical fonts still deduplicate across documents. Stream bytes are folded raw to avoid inflating large font programs on the cache-key path; font_identity_hash_cheap is unchanged. Adds in-memory regression tests (no binary fixtures): same-named fonts with different /ToUnicode decode correctly back-to-back in one process; an identical font hits the global cache while a font with a different /ToUnicode -- or a different stream-form /CIDToGIDMap -- gets its own entry. Extends the cross-document cache hardening in yfedoseev#595, yfedoseev#597, yfedoseev#598. Signed-off-by: Kevin Castro <hola@kev.pe>

regularkevvv · 2026-06-11T03:31:46Z

Thanks for the thorough review and for re-running the regression both ways 🙏

Done — folded the descendant /CIDToGIDMap into the identity key, reusing fold_stream_bytes as you suggested. One small refinement to the snippet: I guarded it with if !matches!(c2g, Object::Name(_)), so the /Identity name — and an absent entry, which defaults to Identity — fold nothing at all (not even the domain byte). That keeps an explicit /CIDToGIDMap /Identity deduping with an absent one (avoids a tiny new dedup split between the two) while still distinguishing the stream form:

// Descendant /CIDToGIDMap: the *stream* form remaps CID→glyph (§9.7.4.3),
// so two otherwise-identical embedded CIDFontType2 fonts with different maps
// select different glyphs and must not alias. The /Identity name — and an
// absent entry, which defaults to Identity — fold nothing.
if let Some(c2g) = dd.get("CIDToGIDMap") {
    if !matches!(c2g, Object::Name(_)) {
        21u8.hash(&mut hasher);
        self.fold_stream_bytes(c2g, &mut hasher);
    }
}

Added a cache-stats regression test (stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts): two fonts identical in name//ToUnicode/metrics but differing only in a stream-form /CIDToGIDMap must get separate cache entries. Verified it fails pre-fold (they alias → one entry) and passes after.

Also rebased onto the latest main (da027a64) and squashed into the single commit. cargo fmt / clippy --all-targets -- -D warnings clean; lib + the three cross-document tests green.

Left your other note (folding /FontFile* via /Length+prefix if it ever shows in profiles) as a future option — happy to follow up if you'd like.

yfedoseev

Re-approved — thank you, this is exactly right, and the refinement is better than my snippet. 🙏

Guarding with if !matches!(c2g, Object::Name(_)) so both the /Identity name and an absent entry fold nothing (not even the domain byte) is the correct call: it keeps an explicit /CIDToGIDMap /Identity deduping with an absent one (Identity is the default, §9.7.4.3) rather than splitting them, while still distinguishing the stream form. The new stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts test plus the build_type0_pdf(cid_base, cid_to_gid) parameterization exercise it directly. Descendant identity is now spec-complete for the cross-document key. CI will run the full suite; no further changes from me.

…ode-collision # Conflicts: # CHANGELOG.md

regularkevvv requested a review from yfedoseev as a code owner June 10, 2026 23:34

yfedoseev approved these changes Jun 11, 2026

View reviewed changes

regularkevvv force-pushed the fix/font-cache-tounicode-collision branch from 983f31a to 22a33f8 Compare June 11, 2026 03:31

yfedoseev approved these changes Jun 11, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/font-cache-tounic…

e284551

…ode-collision # Conflicts: # CHANGELOG.md

yfedoseev merged commit 443c195 into yfedoseev:main Jun 11, 2026
281 of 283 checks passed

yfedoseev mentioned this pull request Jun 12, 2026

release: v0.3.64 #735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fonts): include /ToUnicode content in the cross-document font-cache key#733

fix(fonts): include /ToUnicode content in the cross-document font-cache key#733
yfedoseev merged 2 commits into
yfedoseev:mainfrom
regularkevvv:fix/font-cache-tounicode-collision

regularkevvv commented Jun 10, 2026

Uh oh!

yfedoseev left a comment

Uh oh!

yfedoseev commented Jun 11, 2026

Uh oh!

regularkevvv commented Jun 11, 2026

Uh oh!

yfedoseev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

regularkevvv commented Jun 10, 2026

Description

Approach

Type of Change

Related Issues

Changes Made

Testing

Python Bindings (if applicable)

Documentation

Checklist

Additional Notes

Uh oh!

yfedoseev left a comment

Choose a reason for hiding this comment

Uh oh!

yfedoseev commented Jun 11, 2026

Uh oh!

regularkevvv commented Jun 11, 2026

Uh oh!

yfedoseev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants