Replicate the textual content of a page #2778
-
Hi all, The problem I am tackling is as follows, I have a PDF (single page) where there are white boxes (rectangle drawings) on top of the text. I want to be able to remove them so I could see the underlying text. To check if the text is "actually" present beneath the white-box, I used Inkscape editor to manually adjust the position of the white-box. The text was indeed present. However, I want to automate this process so I won't have to use a GUI-editor for hundreds of such PDFs. I have tried to read and re-read the PyMuPDF documentation but I can't seem to find a solution. A very similar issue was previously discussed here: (#847). Since, I am not comfortable in directly manipulating the content stream, I would appreciate any other method. As possible alternative solution (atleast for the PDFs I am working with) is to simply recreate the textual content of the page on a new page without recreating the drawings. Two methods of interest are Page.insert_text() and Page.insert_textbox(). I have tried to replicate the text information by utilizing the dictionary returned by Page.get_text(“dict”) but the replication is not identical in many cases perhaps due to a mismatch in fontsize, font type used etc. Is there any other way to replicate the same? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
An important aspect would be what type of objects those white boxes are: in fact vector graphics? Or maybe annotations? |
Beta Was this translation helpful? Give feedback.
-
There is if you can get a hold of the original fonts. This can be cumbersome, because in the general case you must
|
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Just a follow-up several fonts are not being displayed when I write them to the new page using Then I extract the font using I then proceed to insert text using page.inset_text() but the output is not what I expected. The text is not being displayed properly, in fact it is only showing empty boxes (see image below): When I inspect the fonts in the newpage via I am aware that the page in question is rotated and I am correcting for that using |
Beta Was this translation helpful? Give feedback.
There is if you can get a hold of the original fonts. This can be cumbersome, because in the general case you must
fitz.TOOLS.set_subset_fontnames(True)
,page.get_fonts()
,doc.extract_font(xref)
,page.insert_font(fontname="unique-string", fontbuffer=buffer)
with the buffer returned by font extraction.page.insert_text(point,text,fontname="unique-string")
. So the fontname must be differe…