fix: cache poisoning when fonts collide on basename with different To… by RayVR · Pull Request #596 · yfedoseev/pdf_oxide

RayVR · 2026-05-27T10:15:14Z

Description

resolves issue #595

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Tests
CI/CD changes

Related Issues

Fixes #595

Changes Made

font_identity_hash_cheap — Return type changed from u64 to (u64, bool). The bool indicates whether the font has a subset prefix (AAAAAA+). Subset fonts have document-specific ToUnicode mappings that are unsafe to share across documents.
load_fonts — The global cross-document font cache (Layer 6) is now skipped for subset fonts, both for lookup and insertion. The per-document cache (Layer 5) still works for subset fonts.
Tests — Updated 4 existing tests to destructure the new return type. Added 1 new test test_font_identity_hash_detects_subset_prefix that verifies AAAAAA+ArialUnicodeMS is detected as subset and ArialUnicodeMS is not.

Root cause: Two different PDFs with the same subset-prefixed font name (e.g., AAAAAA+ArialUnicodeMS) but different glyph subsets would share a cached FontInfo across documents. The first document's ToUnicode CMap would be used for the second document, mapping characters to wrong codepoints (e.g., ° → 日).

Testing

I have added tests that prove my fix is effective or that my feature works
All new and existing tests pass locally
I have run cargo test --all-features
I have run cargo clippy -- -D warnings
I have run cargo fmt

Python Bindings (if applicable)

Python bindings updated (if needed)
Python tests pass
Python code formatted with ruff format
Python code linted with ruff check

Documentation

I have updated the documentation (README, docs/, code comments)
I have added/updated examples (if applicable)
I have updated CHANGELOG.md

Checklist

My code follows the project's coding guidelines (see CONTRIBUTING.md)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have checked my code and corrected any misspellings
The PR title follows conventional commits format (e.g., feat:, fix:, docs:)

Screenshots (if applicable)

Additional Notes

…Unicode streams Signed-off-by: Raymond Roberts <mail@rayvroberts.com>

yfedoseev · 2026-05-27T20:21:11Z

Hi @RayVR — thanks for this, really clean fix. The root-cause analysis in #595 was spot-on, the subset-prefix detector matches PDF 32000-1 §9.6.4 exactly (docs/spec/pdf.md:17842 — "exactly six uppercase letters" + +), and the synthesized regression test in tests/test_subset_font_cache.rs is exactly the right shape — much better than shipping the issue's binary PDFs. The Encoding-ref tightening (hashing id/gen instead of the constant b"enc_ref") is a nice quiet improvement on top.

A few items before merge:

Please address

1. Legacy tests at src/document.rs:15839, :15853, :15866 weren't updated for the new tuple return. They compile and pass — (u64, bool) is PartialEq so assert_eq!(hash1, hash2) still works — but they read as "forgot to update." Mechanical destructure to let (hash1, _) = ... would match the style of your new tests at :17131+.

2. Silent semantic change at src/document.rs:10894 and :11054. Self::font_identity_hash_cheap(&font).hash(&mut h) now folds the is_subset bool into the combined font-set cache key (because (u64, bool) implements Hash). Behavior-wise it's fine — the flag is a deterministic property of the font — but a future reader will trip over it. Either destructure and hash only id_hash (preserves prior semantics), or leave as-is with a one-line comment ("subset flag intentionally included; deterministic per font").

3. is_subset is false when BaseFont is absent. Today the detector only sets the flag inside the Some(Object::Name(n)) = d.get("BaseFont") arm. If a font dict reaches load_fonts without a BaseFont (malformed/inline), is_subset stays false and the font goes into the global cache. Worth either:

asserting BaseFont presence is checked upstream in FontInfo::from_dict, so this branch is unreachable; or
defaulting is_subset = true (fail-safe — exclude from global cache) when BaseFont is missing, since we can't prove safety without it.

The cost of option (b) is at most one re-parse per such font.

Minors

Doc comment on font_identity_hash_cheap still says "For reference-only fields (ToUnicode, FontDescriptor, DescendantFonts), hashes their presence to avoid false positives." That "presence only" phrasing is now misleading since the ToUnicode arm hashes id/gen, and Encoding-ref does too. Tighten to reflect that indirect refs participate in the hash by ID.
Test file clear_global_font_cache() is process-global. Integration tests run in their own binary so cargo's per-binary isolation protects us today, but if anyone later adds a second #[test] here that relies on warm cache state, this will surprise them. A one-line comment at the top of the test ("relies on having the whole binary to itself") is enough.
CHANGELOG entry missing — please add a Fixed entry under the next-release section noting the cross-document cache poisoning for subset fonts (cite [Bug]: reliance on BaseFont name instead of the ToUnicode stream causes corrupted processing #595).

Out of scope for this PR — intentionally deferred

Per spec §9.6.5 (Type 3, docs/spec/pdf.md:17860) and @haberman's follow-up comment on #595, two more "false-hit" classes exist that this PR intentionally does not cover. I've filed them as separate issues so #595 can close cleanly once this lands:

Type 3 fonts must not be cached in the cross-document global font cache #597 — Type 3 fonts must not be cached in the cross-document global font cache (inherently document-local glyph procs; no subset prefix, so the new guard still admits them to Layer 6).
Cross-document font cache returns wrong /Widths when two PDFs share a BaseFont name with different metrics #598 — Cross-document font cache returns wrong /Widths when two PDFs share a BaseFont name with different metrics (affects standard-14 overrides and custom non-subset fonts).

Both are real and the same shape as your fix; please mention them in this PR's description under a "Follow-ups (not in this PR)" section so the scope boundary is explicit. After this lands I'd suggest #595 stays open until #597 and #598 are also resolved — or we close #595 and let #597/#598 carry the residual.

Thanks again — careful spec work and a real regression test, exactly the contribution shape I love to see.

…l-safe for missing BaseFont Fonts without a BaseFont key now default is_subset=true so they are never served from the global cross-document cache. The two combined- hash call sites now destructure the (hash, is_subset) tuple so only the hash participates in the page-level font-set fingerprint. Legacy tests updated to match the new return type. Doc comment clarified, CHANGELOG entry added.

fix: cache poisoning when fonts collide on basename with different To…

a327bcd

…Unicode streams Signed-off-by: Raymond Roberts <mail@rayvroberts.com>

RayVR requested a review from yfedoseev as a code owner May 27, 2026 10:15

This was referenced May 27, 2026

Type 3 fonts must not be cached in the cross-document global font cache #597

Open

Cross-document font cache returns wrong /Widths when two PDFs share a BaseFont name with different metrics #598

Open

yfedoseev mentioned this pull request May 27, 2026

CI infra: cargo-deny fails on yanked aes 0.9.0 and Composer security audit fails on abandoned PHPUnit packages — blocks all PRs #599

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cache poisoning when fonts collide on basename with different To…#596

fix: cache poisoning when fonts collide on basename with different To…#596
RayVR wants to merge 2 commits into
yfedoseev:mainfrom
RayVR:fix/subset-font-cache-poisoning

RayVR commented May 27, 2026 •

edited

Loading

Uh oh!

yfedoseev commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RayVR commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Python Bindings (if applicable)

Documentation

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

yfedoseev commented May 27, 2026

Please address

Minors

Out of scope for this PR — intentionally deferred

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RayVR commented May 27, 2026 •

edited

Loading