fix(fonts): include /ToUnicode content in the cross-document font-cache key#733
Conversation
yfedoseev
left a comment
There was a problem hiding this comment.
Approve — thank you, this is a careful fix for a nasty silent-corruption bug. 🙏
Verified the regression test actually guards the bug (not just passes on the fix): reverting only src/document.rs to pre-fix main and re-running, both tests fail hard — distinct_tounicode_fonts_do_not_collide_across_documents reports 3 of 4 fonts garbled (CID-as-codepoint fallback shows the offset directly: cid_base=1000 → U+03E8 Ϩ, 2000 → U+07D0 ߐ, 40000 → U+9C40 鱀), and the dedup test fails its 'different ToUnicode must decode' assertion. With the fix, both pass. That's exactly the constant-offset-cipher signature described, reproduced.
Why it's correct: the cross-document key folded only the /ToUnicode reference, and is_subset_basefont only excludes canonical AAAAAA+ subsets — so a non-canonical /CIDFont+F1 tag plus template object-number reuse made two genuinely different fonts share a key (Layer 6). Folding the /ToUnicode bytes, the embedded /FontFile{,2,3} bytes, and the descendant /Subtype gives them distinct keys while preserving real dedup (pinned by the second test). font_identity_hash_cheap is untouched, and the hash is memoized per font_ref so the byte-folding is once-per-font-per-document.
pdf.md-aligned: subset tag = six uppercase + + (§9.6.4), FontFile2=TrueType / FontFile3=CFF (Table 126), CIDFontType0/2 distinction (§9.7.4), /ToUnicode as the char→Unicode authority (§9.10.3) — all check out.
Two non-blocking notes for a future pass (neither gates this): (1) folding /FontFile* bytes means computing the key loads the embedded program even on a global-cache hit — fine, but if it ever shows in profiles, a /Length+prefix fold would keep most of the safety cheaper; (2) /CIDToGIDMap (§9.7.4.3, stream form) is a real Type2 glyph-mapping discriminator not folded — contrived, rendering-only, and pre-existing, but folding it would make descendant identity spec-complete.
Thanks again for the thorough writeup and the synthetic, no-fixture repro.
|
Keeping my approval — this fix is correct and complete for #732, and I verified the regression test fails hard pre-fix (3/4 garbled, the CID-as-codepoint offset shows in the output) and passes after. 🎯 One optional completion while you're here (non-blocking): the descendant key folds If you'd like to close it in this PR, it's one more fold in the descendant section, reusing your helper: // Descendant /CIDToGIDMap: stream form remaps CID→glyph (§9.7.4.3); the
// /Identity name folds nothing, so the common path is unaffected.
if let Some(c2g) = dd.get("CIDToGIDMap") {
21u8.hash(&mut hasher);
self.fold_stream_bytes(c2g, &mut hasher);
}Totally fine to land #732's fix now and do |
…he key
The process-global font cache keys parsed fonts by an identity hash that folds
the /ToUnicode reference (object id/gen) but not its bytes, and keeps only
canonical AAAAAA+ subset fonts out of cross-document sharing. A non-canonical
subset tag such as /CIDFont+F1 therefore stays shareable, and PDFs emitted from
a common template reuse the same /ToUnicode object number, so two genuinely
different fonts that merely share a /BaseFont name collide on one key. Processed
in one long-lived process, a later document is served an earlier font's parsed
FontInfo and its glyphs decode through the wrong /ToUnicode (a constant-offset
cipher such as "SUMMARY" -> "6800$5<", or control/PUA characters); each document
extracts correctly in isolation.
font_identity_hash_with_descendants now folds the content of a font's
document-specific streams into the key: the /ToUnicode stream bytes, the
embedded /FontFile{,2,3} bytes (simple fonts and the descendant CIDFont), the
descendant /Subtype, and a stream-form descendant /CIDToGIDMap (ISO 32000-1
§9.7.4.3). The /Identity name -- and an absent entry, which defaults to
Identity -- fold nothing, so the common path's key is unchanged. Same-named-but-
different fonts get distinct keys regardless of subset-tag form or object reuse,
while genuinely identical fonts still deduplicate across documents. Stream bytes
are folded raw to avoid inflating large font programs on the cache-key path;
font_identity_hash_cheap is unchanged.
Adds in-memory regression tests (no binary fixtures): same-named fonts with
different /ToUnicode decode correctly back-to-back in one process; an identical
font hits the global cache while a font with a different /ToUnicode -- or a
different stream-form /CIDToGIDMap -- gets its own entry.
Extends the cross-document cache hardening in yfedoseev#595, yfedoseev#597, yfedoseev#598.
Signed-off-by: Kevin Castro <hola@kev.pe>
983f31a to
22a33f8
Compare
|
Thanks for the thorough review and for re-running the regression both ways 🙏 Done — folded the descendant // Descendant /CIDToGIDMap: the *stream* form remaps CID→glyph (§9.7.4.3),
// so two otherwise-identical embedded CIDFontType2 fonts with different maps
// select different glyphs and must not alias. The /Identity name — and an
// absent entry, which defaults to Identity — fold nothing.
if let Some(c2g) = dd.get("CIDToGIDMap") {
if !matches!(c2g, Object::Name(_)) {
21u8.hash(&mut hasher);
self.fold_stream_bytes(c2g, &mut hasher);
}
}Added a cache-stats regression test ( Also rebased onto the latest Left your other note (folding |
yfedoseev
left a comment
There was a problem hiding this comment.
Re-approved — thank you, this is exactly right, and the refinement is better than my snippet. 🙏
Guarding with if !matches!(c2g, Object::Name(_)) so both the /Identity name and an absent entry fold nothing (not even the domain byte) is the correct call: it keeps an explicit /CIDToGIDMap /Identity deduping with an absent one (Identity is the default, §9.7.4.3) rather than splitting them, while still distinguishing the stream form. The new stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts test plus the build_type0_pdf(cid_base, cid_to_gid) parameterization exercise it directly. Descendant identity is now spec-complete for the cross-document key. CI will run the full suite; no further changes from me.
…ode-collision # Conflicts: # CHANGELOG.md
Description
The process-global font cache (
fonts::global_cache) deduplicates parsed fonts acrossPdfDocumentinstances, keyed by a font identity hash. Building on the cross-document hardening in #595 / #597 / #598 — which folds the/ToUnicodereference into the key and keeps canonicalAAAAAA+subset fonts out of the shared cache — this extends the key to two cases those checks don't yet reach:/CIDFont+F1(emitted by some generators) isn't recognized as a subset, so it stays eligible for cross-document sharing; and/ToUnicodeobject number, so the reference-based key is identical for two genuinely different fonts.When both hold, two fonts that merely share a
/BaseFontname collide on one key; processed back-to-back in one long-lived process, a later document is served an earlier font's parsedFontInfoand its glyphs decode through the wrong/ToUnicode— a constant-offset cipher (SUMMARY→6800$5<) or control/PUA characters. Each document extracts correctly in isolation, which is the signature of process-global cache state rather than a per-file parse issue. In a long-lived process (server, batch job) this silently produces plausible-looking wrong text.Approach
font_identity_hash_with_descendantsnow folds the content of a font's document-specific streams into the key, not just their references:/ToUnicodestream bytes (the decisive discriminator),/FontFile,/FontFile2,/FontFile3) on simple fonts and on the descendant CIDFont, and/Subtype(CIDFontType0 vs CIDFontType2).Same-named-but-different fonts now get distinct keys regardless of subset-tag form or object reuse, while genuinely identical fonts still deduplicate across documents — so the cache keeps its purpose and there's no throughput change. Stream bytes are folded raw (still encoded): a sufficient discriminator that never causes a false dedup and avoids inflating large font programs on the cache-key path.
font_identity_hash_cheapis unchanged and still seeds the hash, so prior identity coverage is preserved bit-for-bit.Type of Change
Related Issues
Fixes #732 · Complements #595, #597, #598 (the cross-document cache hardening this builds on)
Changes Made
src/document.rs—font_identity_hash_with_descendantsfolds the/ToUnicodebytes, the embedded/FontFile{,2,3}bytes, and the descendant/Subtype; adds small helpersfold_stream_bytes/fold_font_program/resolve_indirect_for_hash.font_identity_hash_cheapis unchanged (it still seeds the hash, so prior identity coverage is preserved bit-for-bit).tests/test_font_cache_cross_document.rs— in-code regression tests (hand-built Type0/CIDFontType2 PDFs, no binary fixtures): (1) several same-named fonts with different/ToUnicode, loaded back-to-back in one process, all decode correctly; (2) a byte-identical font hits the global cache (no new entry) while a font with a different/ToUnicodegets its own entry — pinning that the dedup win is preserved.CHANGELOG.md—[Unreleased]entry.Testing
cargo test --lib5554; fullcargo testgreen).cargo test --all-features(ran with default features;ml/ocrfeatures need extra system deps locally — CI covers these).cargo clippy --all-targets -- -D warningscargo fmtPython Bindings (if applicable)
N/A — core Rust cache-key change, no binding API changes; the behaviour flows through to all bindings automatically.
Documentation
[Unreleased]Checklist
Additional Notes
The regression test builds its PDFs in memory from raw operators (per the no-binary-fixtures convention) — fully synthetic, no PII. A standalone, runnable PDF reproducer (two small PDFs +
pip install "pdf_oxide==0.3.63"; python reproduce.py) is attached in a comment on #732 for convenience.