Replicate the textual content of a page #2778

syntax-surgeon · 2023-11-01T15:22:52Z

syntax-surgeon
Nov 1, 2023

Hi all,

The problem I am tackling is as follows, I have a PDF (single page) where there are white boxes (rectangle drawings) on top of the text. I want to be able to remove them so I could see the underlying text.

To check if the text is "actually" present beneath the white-box, I used Inkscape editor to manually adjust the position of the white-box. The text was indeed present. However, I want to automate this process so I won't have to use a GUI-editor for hundreds of such PDFs.

I have tried to read and re-read the PyMuPDF documentation but I can't seem to find a solution. A very similar issue was previously discussed here: (#847). Since, I am not comfortable in directly manipulating the content stream, I would appreciate any other method.

As possible alternative solution (atleast for the PDFs I am working with) is to simply recreate the textual content of the page on a new page without recreating the drawings.

Two methods of interest are Page.insert_text() and Page.insert_textbox(). I have tried to replicate the text information by utilizing the dictionary returned by Page.get_text(“dict”) but the replication is not identical in many cases perhaps due to a mismatch in fontsize, font type used etc.

Is there any other way to replicate the same?

Answered by JorjMcKie

Nov 1, 2023

Is there any other way to replicate the same?

There is if you can get a hold of the original fonts. This can be cumbersome, because in the general case you must

set a global parameter that causes the full font names be contained in the "dict" - i.e. including the "ABCDEF+"-prefixes for the subsetted fonts if any: fitz.TOOLS.set_subset_fontnames(True),
find the xref of that font in page.get_fonts(),
extract the font via doc.extract_font(xref),
insert the font in the new page via page.insert_font(fontname="unique-string", fontbuffer=buffer) with the buffer returned by font extraction.
Insert text using page.insert_text(point,text,fontname="unique-string"). So the fontname must be differe…

View full answer

JorjMcKie · 2023-11-01T15:34:32Z

JorjMcKie
Nov 1, 2023
Maintainer

An important aspect would be what type of objects those white boxes are: in fact vector graphics? Or maybe annotations?

2 replies

JorjMcKie Nov 1, 2023
Maintainer

If any software can "move them around" then this does not sound like a /Contents modification is incurred each time.

syntax-surgeon Nov 1, 2023
Author

@JorjMcKie These white-boxes are not stored as annotations since list(Page.annots()) is empty.

I am attaching a dummy PDF below which can replicate the problem.
The result of Page.get_drawings() is as follows:

[{'items': [('re',
    Rect(312.0, 255.407958984375, 442.2236328125, 305.91998291015625),
    1)],
  'type': 'fs',
  'even_odd': False,
  'fill_opacity': 1.0,
  'fill': (1.0, 1.0, 1.0),
  'rect': Rect(312.0, 255.407958984375, 442.2236328125, 305.91998291015625),
  'seqno': 60,
  'layer': '',
  'stroke_opacity': 1.0,
  'color': (1.0, 1.0, 1.0),
  'width': 0.6712599992752075,
  'lineCap': (0, 0, 0),
  'lineJoin': 0.0,
  'closePath': False,
  'dashes': '[] 0'}]

big_lorem_whitebox.pdf

JorjMcKie · 2023-11-01T20:06:39Z

JorjMcKie
Nov 1, 2023
Maintainer

Is there any other way to replicate the same?

There is if you can get a hold of the original fonts. This can be cumbersome, because in the general case you must

set a global parameter that causes the full font names be contained in the "dict" - i.e. including the "ABCDEF+"-prefixes for the subsetted fonts if any: fitz.TOOLS.set_subset_fontnames(True),
find the xref of that font in page.get_fonts(),
extract the font via doc.extract_font(xref),
insert the font in the new page via page.insert_font(fontname="unique-string", fontbuffer=buffer) with the buffer returned by font extraction.
Insert text using page.insert_text(point,text,fontname="unique-string"). So the fontname must be different for every font you have identified in that way.

1 reply

syntax-surgeon Nov 2, 2023
Author

That works way better than anything I could come up with. Thank you @JorjMcKie

syntax-surgeon · 2023-11-02T07:10:44Z

syntax-surgeon
Nov 2, 2023
Author

@JorjMcKie Just a follow-up several fonts are not being displayed when I write them to the new page using page.insert_text().
The source page produces the following output for page.get_font(): [(13, 'ttf', 'Type0', 'AAAZJC+Carlito', 'F1', 'Identity-H')]

Then I extract the font using myfont = doc.extract_font(13) and insert this font to the newpage as follows: newpage.insert_font(fontname=myfont[0], fontbuffer=myfont[3])

I then proceed to insert text using page.inset_text() but the output is not what I expected. The text is not being displayed properly, in fact it is only showing empty boxes (see image below):

When I inspect the fonts in the newpage via newpage.get_fonts() I get the following output: [(5, 'ttf', 'Type0', '(null)', 'AAAZJC+Carlito', 'Identity-H')].

I am aware that the page in question is rotated and I am correcting for that using point = point*page.rotation_matrix before inserting the text.
How can this be resolved?

3 replies

JorjMcKie Nov 2, 2023
Maintainer

Yes, that may happen sometimes. In this case, you have no choice but locating the original full fontfile for "Carlito" and use it instead. Or of course just fall back to another font you deem sufficiently similar.
Before saving the new document, you may do doc.subset_fonts() to create subset fonts for any such fonts.

syntax-surgeon Nov 2, 2023
Author

What did you mean by "locating the original fontfile"? Did you mean download it externally and then use it?
Also I am not sure by what the value of '(null)' means as a basefont?

JorjMcKie Nov 2, 2023
Maintainer

Yes - this special font is a mimicry of MS Calibri and is provided by Google fonts. You would have to download from there. Or - if you otherwise have access to Calibri, you can also use this one, because it is metrically compatible.
The subset font of Carlito in this case has been built in a way that makes it non-rusable. One indicator is that the fontname is no longer provided ("null"). This is unusual.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replicate the textual content of a page #2778

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Replicate the textual content of a page #2778

Uh oh!

syntax-surgeon Nov 1, 2023

Replies: 3 comments · 6 replies

Uh oh!

JorjMcKie Nov 1, 2023 Maintainer

Uh oh!

JorjMcKie Nov 1, 2023 Maintainer

Uh oh!

syntax-surgeon Nov 1, 2023 Author

Uh oh!

JorjMcKie Nov 1, 2023 Maintainer

Uh oh!

syntax-surgeon Nov 2, 2023 Author

Uh oh!

syntax-surgeon Nov 2, 2023 Author

Uh oh!

JorjMcKie Nov 2, 2023 Maintainer

Uh oh!

syntax-surgeon Nov 2, 2023 Author

Uh oh!

JorjMcKie Nov 2, 2023 Maintainer

syntax-surgeon
Nov 1, 2023

Replies: 3 comments 6 replies

JorjMcKie
Nov 1, 2023
Maintainer

JorjMcKie Nov 1, 2023
Maintainer

syntax-surgeon Nov 1, 2023
Author

JorjMcKie
Nov 1, 2023
Maintainer

syntax-surgeon Nov 2, 2023
Author

syntax-surgeon
Nov 2, 2023
Author

JorjMcKie Nov 2, 2023
Maintainer

syntax-surgeon Nov 2, 2023
Author

JorjMcKie Nov 2, 2023
Maintainer