diff --git a/CHANGELOG.md b/CHANGELOG.md index 1e63a220..e9a6fa5f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,7 @@ All notable changes to PDFOxide are documented here. ### Fixed +- **Cross-document font cache now keys Type0/Identity-H fonts by `/ToUnicode` content (extends #595, #597, #598)** — the cross-document cache hardening in #595/#597/#598 folds the `/ToUnicode` *reference* (object id/gen) into the font identity hash and keeps *canonical* subset fonts (`AAAAAA+`) out of the shared cache. This extends that coverage to two cases the reference-based key doesn't reach: a non-canonical subset tag such as `/CIDFont+F1` (emitted by some generators) stays eligible for cross-document sharing, and PDFs produced from a common template reuse the same `/ToUnicode` object number — so two genuinely different fonts that merely share a `/BaseFont` name produced an identical key. Processed in one long-lived process, a later document was then served an earlier font's parsed `FontInfo` and its glyphs decoded through the wrong `/ToUnicode` — a constant-offset cipher (`SUMMARY` → `6800$5<`) or control/PUA characters — though each document extracted correctly in isolation. The identity hash now folds the `/ToUnicode` stream *bytes*, the embedded `/FontFile{,2,3}` bytes, the descendant `/Subtype`, and a stream-form descendant `/CIDToGIDMap`, so same-named-but-different fonts get distinct keys regardless of subset-tag form or object reuse, while genuinely identical fonts still deduplicate across documents (the cache's purpose is preserved). Same bug class as the `/Widths` poisoning fixed in #598. - **Watermark annotations rendered as nothing in compliant viewers** — a watermark's `/AP` appearance was serialized as a stream nested *directly* inside the annotation dictionary (`/AP <> stream … endstream>>`). A PDF stream must be an indirect object (ISO 32000-1:2008 §7.3.8); the inline form is invalid, so spec-compliant readers (e.g. MuPDF/PyMuPDF) rejected the annotation with "invalid key in dict" and the watermark never appeared — even though the bytes were present in the file. A shared `hoist_appearance_streams` helper now lifts nested `/N`, `/D`, and `/R` appearance streams (including named-state sub-dictionaries) into freshly allocated indirect objects and replaces the slot with a reference, applied on both the `DocumentBuilder` writer and the existing-page `DocumentEditor::save_page` paths. Verified end-to-end with MuPDF: the watermark now parses and renders on both paths. - **Fixed Python type stubs leaking the pyo3 `Py` receiver as a positional parameter** — methods implemented in Rust with a by-value receiver (`fn page(slf_handle: Py, …)` — the idiom pyo3 uses to hand a method an owned handle to its own instance) were emitted by the rylai stub generator with that receiver re-exposed *alongside* the injected `self`. diff --git a/src/document.rs b/src/document.rs index 5ff6dac9..6a4fba16 100644 --- a/src/document.rs +++ b/src/document.rs @@ -14570,35 +14570,64 @@ impl PdfDocument { Some(h) } - /// Document-aware extension of `font_identity_hash_cheap` that resolves - /// `/DescendantFonts` references on Type0 fonts and folds the descendant - /// CIDFont's width metrics (`/DW`, `/DW2`, `/W`, `/W2`) into the hash. - /// - /// Without this, two Type0 fonts whose Type0 dicts have identical inline - /// shape (same BaseFont, Encoding, ToUnicode/DescendantFonts refs) but - /// whose referenced CIDFonts carry different vertical metrics collide on - /// the Layer 5/6 caches — the second document silently inherits the - /// first's `w1y` and renders vertical text at the wrong advance. This is - /// the same bug class as the ToUnicode-stream poisoning fixed in - /// `a327bcd` and the `/Widths` poisoning fixed in #598, applied to the - /// descendant CIDFont's horizontal AND vertical width arrays. - /// - /// Cost: one `load_object` per descendant CIDFont (typically one) on the - /// first call; subsequent calls hit `font_id_hash_cache`. The descendant - /// load is the same work `FontInfo::from_dict` will do later, so the - /// marginal cost when a font actually needs parsing is zero; the only - /// new work is on cache *hits* that previously skipped descendant - /// resolution entirely. In return we trade off one indirect-ref load per - /// unique Type0 font per process for correctness on /W2 + /DW2. + /// Document-aware extension of `font_identity_hash_cheap` that folds the + /// *content* of a font's document-specific streams — its `/ToUnicode` CMap + /// and embedded font program(s) — plus the descendant CIDFont's width + /// metrics (`/DW`, `/DW2`, `/W`, `/W2`) and stream-form `/CIDToGIDMap` into + /// the identity hash. + /// + /// Why content, not just references: `font_identity_hash_cheap` folds only + /// the *reference* (object id/gen) of `/ToUnicode`, and the global cache is + /// skipped only for *canonical* subset fonts (`AAAAAA+`, six uppercase + /// letters + `+`; see `is_subset_basefont`). A non-canonical subset tag + /// such as `/CIDFont+F1` is therefore still shared cross-document, and + /// PDFs emitted from a common template reuse the same `/ToUnicode` object + /// number — so two genuinely different fonts that merely share a + /// `/BaseFont` name produce an identical cheap hash. Keyed only by that + /// hash, the cross-document global cache (Layer 6) served a later document + /// the *earlier* font's parsed `FontInfo`, and its glyph→Unicode mapping + /// came out as a constant-offset cipher or control/PUA junk (e.g. + /// `SUMMARY` → `6800$5<`). Folding the `/ToUnicode` stream bytes — and the + /// embedded `/FontFile{,2,3}` bytes — gives such fonts distinct keys so + /// they can never collide regardless of subset-tag form or object reuse, + /// while genuinely identical fonts still dedup. This completes the + /// cross-document hardening from #595/#597/#598 (which folded the + /// `/ToUnicode` *reference* and the `/Widths`, and excluded canonical + /// `AAAAAA+` subsets), applied to the field that actually decodes text. + /// + /// Cost: a few extra `load_object` calls (the `/ToUnicode` stream, each + /// descendant CIDFont, the `/FontDescriptor`s and their font programs) on + /// the first encounter of a font per document; subsequent calls hit + /// `font_id_hash_cache`, and the loads themselves are served from the + /// object cache that `FontInfo::from_dict` populates anyway. Stream bytes + /// are folded *raw* (still encoded) — see `fold_stream_bytes`. fn font_identity_hash_with_descendants(&self, font_obj: &Object) -> u64 { use std::hash::{Hash, Hasher}; // Seed with the cheap inline hash so existing identity coverage is - // preserved bit-for-bit when there are no descendants to fold in. + // preserved bit-for-bit when there are no streams/descendants to fold. let base = Self::font_identity_hash_cheap(font_obj); let mut hasher = std::collections::hash_map::DefaultHasher::new(); base.hash(&mut hasher); if let Some(d) = font_obj.as_dict() { + // /ToUnicode stream BYTES — the decisive discriminator. The cheap + // hash folds only this stream's reference; folding its content is + // what stops same-named, differently-mapped fonts from colliding + // across documents when the cheap key matches (#595). + if let Some(to_unicode) = d.get("ToUnicode") { + 17u8.hash(&mut hasher); + self.fold_stream_bytes(to_unicode, &mut hasher); + } + + // Simple fonts (Type1/TrueType) carry their embedded program on the + // top-level /FontDescriptor. Two subset fonts that share a + // /BaseFont name but embed different glyph programs must not alias. + if let Some(fd) = d.get("FontDescriptor") { + if let Some(fd_obj) = self.resolve_indirect_for_hash(fd) { + self.fold_font_program(&fd_obj, 18, &mut hasher); + } + } + if let Some(Object::Array(arr)) = d.get("DescendantFonts") { // Domain separator for the descendant section. 11u8.hash(&mut hasher); @@ -14646,6 +14675,35 @@ impl PdfDocument { 16u8.hash(&mut hasher); Self::hash_pdf_object_deterministic(csi, &mut hasher); } + // Descendant /Subtype: CIDFontType0 (CFF) and CIDFontType2 + // (TrueType) are not interchangeable even with identical + // name + metrics; the top-level Subtype is `Type0` for both. + if let Some(st) = dd.get("Subtype") { + 19u8.hash(&mut hasher); + Self::hash_pdf_object_deterministic(st, &mut hasher); + } + // Embedded CIDFont program lives on the descendant's + // /FontDescriptor (/FontFile2 for TrueType, /FontFile3 for + // CFF). Folded under a distinct section so it cannot alias + // a simple font's top-level program. + if let Some(fd) = dd.get("FontDescriptor") { + if let Some(fd_obj) = self.resolve_indirect_for_hash(fd) { + self.fold_font_program(&fd_obj, 20, &mut hasher); + } + } + // Descendant /CIDToGIDMap: the *stream* form remaps + // CID→glyph (§9.7.4.3), so two otherwise-identical embedded + // CIDFontType2 fonts with different maps select different + // glyphs and must not alias. The `/Identity` name — and an + // absent entry, which defaults to Identity — fold nothing, + // so the common path's key is unchanged (and an explicit + // `/Identity` still dedups with an absent one). + if let Some(c2g) = dd.get("CIDToGIDMap") { + if !matches!(c2g, Object::Name(_)) { + 21u8.hash(&mut hasher); + self.fold_stream_bytes(c2g, &mut hasher); + } + } } } } @@ -14653,6 +14711,74 @@ impl PdfDocument { hasher.finish() } + /// Resolve a single level of indirection for hashing: returns the + /// referenced object, the object itself when already inline, or `None` + /// when a reference cannot be loaded (cycle/missing). Used only to reach a + /// `/FontDescriptor` dict — it never re-enters the font dict, so it cannot + /// loop. + fn resolve_indirect_for_hash(&self, obj: &Object) -> Option { + match obj { + Object::Reference(r) => self.load_object(*r).ok(), + other => Some(other.clone()), + } + } + + /// Fold the *raw* bytes of a (possibly indirectly-referenced) stream into + /// the hash. Folds nothing when the object is absent, unreadable, or not a + /// stream. + /// + /// Raw — still-encoded — bytes are deliberate. They are a sufficient + /// discriminator: different decoded content yields different encoded bytes + /// under any deterministic filter, so this never produces a *false* dedup + /// (two different fonts sharing a key). It avoids inflating large font + /// programs on the cache-key path. The only cost is a *missed* dedup when + /// the same logical content is stored under two different filters + /// (e.g. raw vs. FlateDecode) — harmless, and not a pattern a single + /// producer emits within a corpus. + fn fold_stream_bytes(&self, obj: &Object, hasher: &mut H) { + use std::hash::Hash; + let owned; + let stream: &Object = match obj { + Object::Stream { .. } => obj, + Object::Reference(r) => match self.load_object(*r) { + Ok(o) => { + owned = o; + &owned + }, + Err(_) => return, + }, + _ => return, + }; + if let Object::Stream { data, .. } = stream { + (data.len() as u64).hash(hasher); + data.as_ref().hash(hasher); + } + } + + /// Fold any embedded font program (`/FontFile`, `/FontFile2`, + /// `/FontFile3`) reachable from a `/FontDescriptor` dict into the hash, + /// namespaced by `section` so a simple font's program and a descendant + /// CIDFont's program cannot alias each other. + fn fold_font_program( + &self, + descriptor: &Object, + section: u8, + hasher: &mut H, + ) { + use std::hash::Hash; + let dict = match descriptor.as_dict() { + Some(d) => d, + None => return, + }; + for (variant, key) in ["FontFile", "FontFile2", "FontFile3"].iter().enumerate() { + if let Some(ff) = dict.get(*key) { + section.hash(hasher); + (variant as u8).hash(hasher); + self.fold_stream_bytes(ff, hasher); + } + } + } + /// Hash a PDF `Object` deterministically. Used by the descendant-aware /// font identity hash to fold raw width-array content into the key. /// diff --git a/tests/test_font_cache_cross_document.rs b/tests/test_font_cache_cross_document.rs new file mode 100644 index 00000000..a72caf31 --- /dev/null +++ b/tests/test_font_cache_cross_document.rs @@ -0,0 +1,250 @@ +//! Cross-document font-cache collision regression (completes #595, #597, #598). +//! +//! The process-global font cache (`fonts::global_cache`) is keyed by a font +//! *identity hash*. The #595 hardening folds the `/ToUnicode` *reference* +//! (object id/gen) into that hash and keeps *canonical* subset fonts +//! (`AAAAAA+`) out of the cache. A non-canonical subset tag such as +//! `/CIDFont+F1` falls outside that exclusion, and template-emitted PDFs reuse +//! the same `/ToUnicode` object number, so the reference-keyed hash can still +//! match for two genuinely different fonts — the later document is then served +//! the earlier font's parsed `FontInfo`, and its glyphs decode through the +//! wrong `/ToUnicode` and come out garbled. Folding the stream's bytes (not +//! just its reference) distinguishes them and closes this case. +//! +//! Both PDFs here are built in memory (per the repo's no-binary-fixtures +//! convention) and are byte-for-byte identical except for the CID→Unicode +//! mapping: same `/BaseFont` (`/CIDFont+F1`, the non-canonical subset tag some +//! real generators emit), same object numbers, same width metrics — only the +//! `/ToUnicode` stream and the matching content-stream CIDs differ. That is the +//! exact shape that triggered the leak. +//! +//! Oracle: correct text contains the header `SUMMARY`; a font decoded through +//! another document's `/ToUnicode` does not. + +use pdf_oxide::document::PdfDocument; +use pdf_oxide::fonts::global_cache::{clear_global_font_cache, global_font_cache_stats}; +use std::sync::Mutex; + +/// Serializes the two tests in this binary: both assert against the +/// process-global cache, so they must not run concurrently. +static CACHE_LOCK: Mutex<()> = Mutex::new(()); + +/// Lines rendered on the single page. The content is fabricated and trivial; +/// only the presence of `SUMMARY` matters to the oracle. +const LINES: &[&str] = &[ + "SUMMARY", + "Synthetic document for the font-cache regression.", + "Text is recoverable only via the ToUnicode CMap.", +]; + +/// Build a minimal non-embedded Type0/Identity-H PDF in memory. +/// +/// Every document shares one object layout and `/BaseFont` name, so their cheap +/// identity hashes collide. `cid_base` shifts the (otherwise sequential) glyph +/// indices, mirroring a real subset font whose CIDs are arbitrary indices +/// unrelated to Unicode and recoverable only through `/ToUnicode`. Two +/// documents built with different `cid_base` therefore carry byte-different +/// `/ToUnicode` streams and content-stream CIDs while remaining identical in +/// every field the pre-fix key looked at. +fn build_type0_pdf(cid_base: u16, cid_to_gid: Option<&[u8]>) -> Vec { + // Distinct characters in first-appearance order; CID = cid_base + index. + let mut chars: Vec = Vec::new(); + for ch in LINES.iter().flat_map(|l| l.chars()) { + if !chars.contains(&ch) { + chars.push(ch); + } + } + let cid = |ch: char| -> u16 { + let idx = chars.iter().position(|&c| c == ch).unwrap() as u16; + cid_base + idx + }; + + // Content stream: 2-byte CIDs, one `Tj` per line. + let mut content = String::from("BT\n/F1 13 Tf\n15 TL\n40 770 Td\n"); + for line in LINES { + let hex: String = line.chars().map(|ch| format!("{:04X}", cid(ch))).collect(); + content.push_str(&format!("<{hex}> Tj\nT*\n")); + } + content.push_str("ET"); + + // ToUnicode CMap inverting the CID→Unicode mapping. + let bfchar: String = chars + .iter() + .map(|&ch| format!("<{:04X}> <{:04X}>", cid(ch), ch as u32)) + .collect::>() + .join("\n"); + let cmap = format!( + "/CIDInit /ProcSet findresource begin\n12 dict begin\nbegincmap\n\ + /CIDSystemInfo <> def\n\ + /CMapName /Adobe-Identity-UCS def\n/CMapType 2 def\n\ + 1 begincodespacerange\n<0000> \nendcodespacerange\n\ + {} beginbfchar\n{}\nendbfchar\n\ + endcmap\nCMapName currentdict /CMap defineresource pop\nend\nend", + chars.len(), + bfchar + ); + + // /CIDToGIDMap defaults to the `/Identity` name; `cid_to_gid` switches it to + // the stream form (object 9) so a test can vary its bytes. + let cid_to_gid_entry = if cid_to_gid.is_some() { + "/CIDToGIDMap 9 0 R" + } else { + "/CIDToGIDMap /Identity" + }; + + let mut objs: Vec> = vec![ + b"<< /Type /Catalog /Pages 2 0 R >>".to_vec(), + b"<< /Type /Pages /Kids [3 0 R] /Count 1 >>".to_vec(), + b"<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] \ + /Resources << /Font << /F1 5 0 R >> >> /Contents 4 0 R >>" + .to_vec(), + format!("<< /Length {} >>\nstream\n{content}\nendstream", content.len()).into_bytes(), + b"<< /Type /Font /Subtype /Type0 /BaseFont /CIDFont+F1 /Encoding /Identity-H \ + /DescendantFonts [6 0 R] /ToUnicode 8 0 R >>" + .to_vec(), + format!( + "<< /Type /Font /Subtype /CIDFontType2 /BaseFont /CIDFont+F1 \ + /CIDSystemInfo << /Registry (Adobe) /Ordering (Identity) /Supplement 0 >> \ + /FontDescriptor 7 0 R /DW 500 {cid_to_gid_entry} >>" + ) + .into_bytes(), + // No /FontFile* — non-embedded, like the real garbled documents. + b"<< /Type /FontDescriptor /FontName /CIDFont+F1 /Flags 4 \ + /FontBBox [0 -200 1000 900] /ItalicAngle 0 /Ascent 800 /Descent -200 \ + /CapHeight 700 /StemV 80 /MissingWidth 500 >>" + .to_vec(), + format!("<< /Length {} >>\nstream\n{cmap}\nendstream", cmap.len()).into_bytes(), + ]; + if let Some(map) = cid_to_gid { + let mut obj = format!("<< /Length {} >>\nstream\n", map.len()).into_bytes(); + obj.extend_from_slice(map); + obj.extend_from_slice(b"\nendstream"); + objs.push(obj); + } + + // Assemble with a byte-accurate xref table. + let mut out: Vec = b"%PDF-1.7\n".to_vec(); + let mut offsets = Vec::with_capacity(objs.len()); + for (i, body) in objs.iter().enumerate() { + offsets.push(out.len()); + out.extend_from_slice(format!("{} 0 obj\n", i + 1).as_bytes()); + out.extend_from_slice(body); + out.extend_from_slice(b"\nendobj\n"); + } + let xref_off = out.len(); + let size = objs.len() + 1; + out.extend_from_slice(format!("xref\n0 {size}\n0000000000 65535 f \n").as_bytes()); + for off in &offsets { + out.extend_from_slice(format!("{off:010} 00000 n \n").as_bytes()); + } + out.extend_from_slice( + format!("trailer\n<< /Size {size} /Root 1 0 R >>\nstartxref\n{xref_off}\n%%EOF").as_bytes(), + ); + out +} + +fn extract_first_page(bytes: Vec) -> String { + let doc = PdfDocument::from_bytes(bytes).expect("parse synthetic PDF"); + doc.extract_text(0).expect("extract page 0") +} + +/// Several documents that share a `/BaseFont` name but map glyphs differently +/// must each decode through their own `/ToUnicode`, even when processed +/// back-to-back in one process without clearing the cache between them. +#[test] +fn distinct_tounicode_fonts_do_not_collide_across_documents() { + let _guard = CACHE_LOCK.lock().unwrap_or_else(|e| e.into_inner()); + clear_global_font_cache(); + + // Distinct CID bases ⇒ distinct ToUnicode streams. The first document + // primes the global cache; before the fix, every later one inherited its + // mapping. The bases are arbitrary, only mutually distinct. + let bases = [3u16, 1000, 2000, 40000]; + let mut garbled = Vec::new(); + for base in bases { + let text = extract_first_page(build_type0_pdf(base, None)); + if !text.contains("SUMMARY") { + let preview: String = text.chars().take(48).collect(); + garbled.push(format!("cid_base={base}: {preview:?}")); + } + } + + assert!( + garbled.is_empty(), + "{} of {} same-named fonts decoded through another document's ToUnicode \ + (cross-document font-cache collision):\n {}", + garbled.len(), + bases.len(), + garbled.join("\n ") + ); +} + +/// The precise key must not regress the dedup the global cache exists for: +/// a byte-identical font (different document) is a cache *hit* with no new +/// entry, while a font with a different `/ToUnicode` gets its own entry. +#[test] +fn identical_fonts_dedup_while_distinct_fonts_get_separate_entries() { + let _guard = CACHE_LOCK.lock().unwrap_or_else(|e| e.into_inner()); + clear_global_font_cache(); + assert_eq!(global_font_cache_stats().0, 0, "cache should be empty after clear"); + + // Each document defines exactly one cross-document-shareable Type0 font, so + // the cache grows by one entry per *distinct* font. + assert!(extract_first_page(build_type0_pdf(3, None)).contains("SUMMARY")); + let after_first = global_font_cache_stats().0; + assert_eq!(after_first, 1, "first document inserts exactly one font"); + + // Same bytes, brand-new PdfDocument: must hit the global cache, not reinsert. + assert!(extract_first_page(build_type0_pdf(3, None)).contains("SUMMARY")); + assert_eq!( + global_font_cache_stats().0, + after_first, + "an identical font must hit the global cache rather than re-insert" + ); + + // Different ToUnicode: must get its own entry (the absence of which was the + // collision bug) and decode correctly. + assert!(extract_first_page(build_type0_pdf(2000, None)).contains("SUMMARY")); + assert_eq!( + global_font_cache_stats().0, + after_first + 1, + "a font with a different ToUnicode must not alias the cached one" + ); +} + +/// A *stream*-form `/CIDToGIDMap` remaps CID→glyph (ISO 32000-1 §9.7.4.3), so +/// two embedded CIDFontType2 fonts identical in name, `/ToUnicode`, and metrics +/// but differing in that stream are not interchangeable and must get separate +/// cache entries. (PR #733 review: the `/Identity` name, the default, still +/// folds nothing — that case is covered by the tests above.) +#[test] +fn stream_cid_to_gid_map_distinguishes_otherwise_identical_fonts() { + let _guard = CACHE_LOCK.lock().unwrap_or_else(|e| e.into_inner()); + clear_global_font_cache(); + + // Same cid_base ⇒ identical /ToUnicode and content; the ONLY difference is + // the /CIDToGIDMap stream. Sized to cover every CID used (2 bytes/CID), GIDs + // kept < 0x80 so it is well-formed map data. + let distinct = LINES + .iter() + .flat_map(|l| l.chars()) + .fold(Vec::new(), |mut v, c| { + if !v.contains(&c) { + v.push(c); + } + v + }); + let len = 2 * (0x21 + distinct.len()); + let map_a: Vec = (0..len).map(|i| (i % 0x40) as u8).collect(); + let mut map_b = map_a.clone(); + *map_b.last_mut().unwrap() ^= 0x01; // differ by a single byte + + assert!(extract_first_page(build_type0_pdf(0x21, Some(&map_a))).contains("SUMMARY")); + let after_a = global_font_cache_stats().0; + assert!(extract_first_page(build_type0_pdf(0x21, Some(&map_b))).contains("SUMMARY")); + assert_eq!( + global_font_cache_stats().0, + after_a + 1, + "fonts differing only in a stream /CIDToGIDMap must not alias in the cache" + ); +}