Skip to content

fix: cache poisoning when fonts collide on basename with different To…#596

Open
RayVR wants to merge 2 commits into
yfedoseev:mainfrom
RayVR:fix/subset-font-cache-poisoning
Open

fix: cache poisoning when fonts collide on basename with different To…#596
RayVR wants to merge 2 commits into
yfedoseev:mainfrom
RayVR:fix/subset-font-cache-poisoning

Conversation

@RayVR
Copy link
Copy Markdown

@RayVR RayVR commented May 27, 2026

Description

resolves issue #595

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Tests
  • CI/CD changes

Related Issues

Fixes #595

Changes Made

  1. font_identity_hash_cheap — Return type changed from u64 to (u64, bool). The bool indicates whether the font has a subset prefix (AAAAAA+). Subset fonts have document-specific ToUnicode mappings that are unsafe to share across documents.

  2. load_fonts — The global cross-document font cache (Layer 6) is now skipped for subset fonts, both for lookup and insertion. The per-document cache (Layer 5) still works for subset fonts.

  3. Tests — Updated 4 existing tests to destructure the new return type. Added 1 new test test_font_identity_hash_detects_subset_prefix that verifies AAAAAA+ArialUnicodeMS is detected as subset and ArialUnicodeMS is not.

Root cause: Two different PDFs with the same subset-prefixed font name (e.g., AAAAAA+ArialUnicodeMS) but different glyph subsets would share a cached FontInfo across documents. The first document's ToUnicode CMap would be used for the second document, mapping characters to wrong codepoints (e.g., ° → 日).

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • All new and existing tests pass locally
  • I have run cargo test --all-features
  • I have run cargo clippy -- -D warnings
  • I have run cargo fmt

Python Bindings (if applicable)

  • Python bindings updated (if needed)
  • Python tests pass
  • Python code formatted with ruff format
  • Python code linted with ruff check

Documentation

  • I have updated the documentation (README, docs/, code comments)
  • I have added/updated examples (if applicable)
  • I have updated CHANGELOG.md

Checklist

  • My code follows the project's coding guidelines (see CONTRIBUTING.md)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have checked my code and corrected any misspellings
  • The PR title follows conventional commits format (e.g., feat:, fix:, docs:)

Screenshots (if applicable)

Additional Notes

…Unicode streams

Signed-off-by: Raymond Roberts <mail@rayvroberts.com>
@yfedoseev
Copy link
Copy Markdown
Owner

Hi @RayVR — thanks for this, really clean fix. The root-cause analysis in #595 was spot-on, the subset-prefix detector matches PDF 32000-1 §9.6.4 exactly (docs/spec/pdf.md:17842 — "exactly six uppercase letters" + +), and the synthesized regression test in tests/test_subset_font_cache.rs is exactly the right shape — much better than shipping the issue's binary PDFs. The Encoding-ref tightening (hashing id/gen instead of the constant b"enc_ref") is a nice quiet improvement on top.

A few items before merge:

Please address

1. Legacy tests at src/document.rs:15839, :15853, :15866 weren't updated for the new tuple return. They compile and pass — (u64, bool) is PartialEq so assert_eq!(hash1, hash2) still works — but they read as "forgot to update." Mechanical destructure to let (hash1, _) = ... would match the style of your new tests at :17131+.

2. Silent semantic change at src/document.rs:10894 and :11054. Self::font_identity_hash_cheap(&font).hash(&mut h) now folds the is_subset bool into the combined font-set cache key (because (u64, bool) implements Hash). Behavior-wise it's fine — the flag is a deterministic property of the font — but a future reader will trip over it. Either destructure and hash only id_hash (preserves prior semantics), or leave as-is with a one-line comment ("subset flag intentionally included; deterministic per font").

3. is_subset is false when BaseFont is absent. Today the detector only sets the flag inside the Some(Object::Name(n)) = d.get("BaseFont") arm. If a font dict reaches load_fonts without a BaseFont (malformed/inline), is_subset stays false and the font goes into the global cache. Worth either:

  • asserting BaseFont presence is checked upstream in FontInfo::from_dict, so this branch is unreachable; or
  • defaulting is_subset = true (fail-safe — exclude from global cache) when BaseFont is missing, since we can't prove safety without it.

The cost of option (b) is at most one re-parse per such font.

Minors

  • Doc comment on font_identity_hash_cheap still says "For reference-only fields (ToUnicode, FontDescriptor, DescendantFonts), hashes their presence to avoid false positives." That "presence only" phrasing is now misleading since the ToUnicode arm hashes id/gen, and Encoding-ref does too. Tighten to reflect that indirect refs participate in the hash by ID.
  • Test file clear_global_font_cache() is process-global. Integration tests run in their own binary so cargo's per-binary isolation protects us today, but if anyone later adds a second #[test] here that relies on warm cache state, this will surprise them. A one-line comment at the top of the test ("relies on having the whole binary to itself") is enough.
  • CHANGELOG entry missing — please add a Fixed entry under the next-release section noting the cross-document cache poisoning for subset fonts (cite [Bug]: reliance on BaseFont name instead of the ToUnicode stream causes corrupted processing #595).

Out of scope for this PR — intentionally deferred

Per spec §9.6.5 (Type 3, docs/spec/pdf.md:17860) and @haberman's follow-up comment on #595, two more "false-hit" classes exist that this PR intentionally does not cover. I've filed them as separate issues so #595 can close cleanly once this lands:

Both are real and the same shape as your fix; please mention them in this PR's description under a "Follow-ups (not in this PR)" section so the scope boundary is explicit. After this lands I'd suggest #595 stays open until #597 and #598 are also resolved — or we close #595 and let #597/#598 carry the residual.

Thanks again — careful spec work and a real regression test, exactly the contribution shape I love to see.

…l-safe for missing BaseFont

Fonts without a BaseFont key now default is_subset=true so they are
never served from the global cross-document cache.  The two combined-
hash call sites now destructure the (hash, is_subset) tuple so only
the hash participates in the page-level font-set fingerprint.  Legacy
tests updated to match the new return type.  Doc comment clarified,
CHANGELOG entry added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: reliance on BaseFont name instead of the ToUnicode stream causes corrupted processing

2 participants