You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/changes.rst
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,23 @@
1
1
Change Logs
2
2
===============
3
3
4
+
Changes in Version 1.18.16
5
+
---------------------------
6
+
* **Fixed** issue `#1184 <https://github.com/pymupdf/PyMuPDF/issues/1184>`_. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font).
7
+
8
+
* **Fixed** issue `#1154 <https://github.com/pymupdf/PyMuPDF/issues/1154>`_. Text search hits should now be correct when ``clip`` is specified.
* **Added** :attr:`Link.flags` and :meth:`Link.set_flags` to the :ref:`Link` class. Implements enhancement requests `#1187 <https://github.com/pymupdf/PyMuPDF/issues/1187>`_.
15
+
16
+
* **Added** option to *simulate* :meth:`TextWriter.fill_textbox` output for predicting the number of lines, that a given text would occupy in the textbox.
17
+
18
+
* **Added** text output support as subcommand `gettext` to the ``fitz`` CLI module. Most importantly, original **physical text layout** reproduction is now supported.
19
+
20
+
4
21
Changes in Version 1.18.15
5
22
---------------------------
6
23
* **Fixed** issue `#1088 <https://github.com/pymupdf/PyMuPDF/issues/1088>`_. Removing an annotation's fill color should now work again both ways, using the ``fill_color=[]`` argument in :meth:`Annot.update` as well as ``fill=[]`` in :meth:`Annot.set_colors`.
Copy file name to clipboardExpand all lines: docs/document.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1103,6 +1103,8 @@ For details on **embedded files** refer to Appendix 3.
1103
1103
1104
1104
:arg str user_pw: *(new in version 1.16.0)* set the document's user password.
1105
1105
1106
+
.. note:: The method does not check, whether a file of that name already exists, will hence not ask for confirmation, and overwrite the file. It is your responsibility as a programmer to handle this.
Copy file name to clipboardExpand all lines: docs/faq.rst
+32Lines changed: 32 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2103,6 +2103,38 @@ If it is *False* or if you want to be on the safe side, pick one of the followin
2103
2103
page.wrap_contents()
2104
2104
>>> # start inserting text, images or annotations here
2105
2105
2106
+
2107
+
Missing or Unreadable Extracted Text
2108
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2109
+
This can be a number of different problems.
2110
+
2111
+
Problem: no text
2112
+
^^^^^^^^^^^^^^^^
2113
+
Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing.
2114
+
2115
+
Cause
2116
+
^^^^^^
2117
+
1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF).
2118
+
2. The PDF creator used no font, but **simulated** text by painting it, using little lines and curves. E.g. a capital "D" could be painted by a line "|" and a left-open semi-circle, an "o" by an ellipse, and so on.
2119
+
2120
+
Solution
2121
+
^^^^^^^^^^
2122
+
Use an OCR software like `OCRmyPDF <https://pypi.org/project/ocrmypdf/>`_ to insert a hidden text layer underneath the visible page. The resulting PDF should behave as expected.
2123
+
2124
+
Problem: unreadable text
2125
+
^^^^^^^^^^^^^^^^^^^^^^^^
2126
+
Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled.
2127
+
2128
+
Cause
2129
+
^^^^^^
2130
+
1. The single characters are redable as such (no "<?>" symbols), but the sequence in which the text is **coded in the file** deviates from the reading order. The motivation behind may be technical or protection of data against unwanted copies.
2131
+
2. Many "<?>" symbols occur indicating MuPDF could not interpret these characters. The PDF creator may haved used a font that displays readable text, but obfuscates the unicode character that leads to the readable symbol (glyph).
2132
+
2133
+
Solution
2134
+
^^^^^^^^
2135
+
1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``.
2136
+
2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page.
Copy file name to clipboardExpand all lines: docs/functions.rst
+71-3Lines changed: 71 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,7 @@ Yet others are handy, general-purpose utilities.
40
40
:meth:`Page.run` run a page through a device
41
41
:meth:`Page.read_contents` PDF only: get complete, concatenated /Contents source
42
42
:meth:`Page.wrap_contents` wrap contents with stacking commands
43
+
:meth:`Page._get_texttrace()` low-level text information
43
44
:attr:`Page.is_wrapped` check whether contents wrapping is present
44
45
:meth:`planish_line` matrix to map a line to the x-axis
45
46
:meth:`paper_size` return width, height for a known paper format
@@ -396,12 +397,79 @@ Yet others are handy, general-purpose utilities.
396
397
397
398
-----
398
399
399
-
.. method:: Page.wrap_contents
400
+
.. method:: Page.wrap_contents()
400
401
401
402
Put string pair "q" / "Q" before, resp. after a page's */Contents* object(s) to ensure that any "geometry" changes are **local** only.
402
403
403
404
Use this method as an alternative, minimalistic version of :meth:`Page.clean_contents`. Its advantage is a small footprint in terms of processing time and impact on the data size of incremental saves.
404
405
406
+
-----
407
+
408
+
.. method:: Page._get_texttrace()
409
+
410
+
*New in v1.18.16*
411
+
412
+
Return low-level text information of the page (**all** document types). This is a list of Python dictionaries with the following content::
413
+
414
+
{
415
+
'ascender': 0.75, # font ascender (1)
416
+
'bidi': 0, # bidirectional level (1)
417
+
'chars': ( # char information, tuple[tuple]
418
+
(32, # unicode (4)
419
+
3, # glyph id (font dependent)
420
+
(470.3800354003906, # origin.x (1)
421
+
755.3758544921875), # origin.y (1)
422
+
2.495859366375953 # width (points)
423
+
),
424
+
),
425
+
'color': (0.0,), # text color, tuple[float] (1)
426
+
'colorspace': 1, # number of colorspace components (1)
427
+
'descender': -0.25, # font descender (1)
428
+
'dir': (1.0, 0.0), # writing direction (1)
429
+
'flags': 4, # font flags (1)
430
+
'font': 'Calibri', # font name (1)
431
+
'linewidth': 0.5519999980926514, # last know line width value (3)
432
+
'opacity': 1.0, # alpha value of the text (5)
433
+
'scissor': (1.0, 1.0, -1.0, -1.0), # <ignore>
434
+
'size': 11.039999961853027, # font size (1)
435
+
'spacewidth': 2.495859366375953, # width of space character (synthesized)
436
+
'type': 0, # span type (2)
437
+
'wmode': 0 # writing mode (1)
438
+
}
439
+
440
+
Details:
441
+
442
+
1. Same meaning as explained in :ref:`TextPage`.
443
+
2. There are 5 text span types:
444
+
445
+
0. Filled text -- equivalent to PDF text rendering mode 0 (``0 Tr``), only the characters' inside is shown.
446
+
1. Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
447
+
2. Clipped text -- details yet unknown.
448
+
3. Clip-stroked text -- details yet unknown.
449
+
4. Ignored text -- equivalent to ``3 Tr``.
450
+
451
+
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of stroked lines. This value may not be provided in the data. In this case, a value of ``span["size"] * 0,05`` is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent text span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except one is ``span["type"] = 0`` and the other one ``span["type"] = 1``.
452
+
4. For data compactness, the character's unicode is provided here. Use function ``chr()`` for the character itself.
453
+
5. The alpha / pacity value of the span's text, 0 <= opacity <= 1. Zero is invisible text, 1 (100%) covers what is behind.
454
+
455
+
Here is a list of similarities and differences of ``page._get_texttrace()`` compared to ``page.get_text("rawdict")``:
456
+
457
+
* The method is up to **twice as fast.**
458
+
* The returned information is very **much smaller in size.**
459
+
* Additional types of text **invisibility can be detected**: opacity = 0 and type = 4.
460
+
* Character bboxes are not provided; if needed, compute them from available information.
461
+
* If MuPDF returns unicode 0xFFFD (65533) for unrecognized characters, you may still be able to deduct required information from the glyph id.
462
+
* The ``span["chars"]`` **contains no spaces**, **except** the document creator has coded them. They **will never be generated** like it happens in :meth:`Page.get_text` methods. To provide some help for doing your own computations here, the width of a space character is given. This value is derived from the font where possible. Otherwise a synthetic value is taken.
463
+
* There is no effort to organize text like it happens for a :ref:`TextPage` (the hierarchy of blocks, lines, spans, and characters). Characters are simply extracted in sequence, one by one, and put in a span. Whenever any of the span's characteristics change, a new span is started. So you may find characters with different ``origin.y`` values in the same span. You cannot assume, that span characters are sorted in any particular order -- you must make sense of the info yourself, taking ``span["dir"]``, ``span["wmode"]``, etc. into account.
464
+
* Ligatures are represented like this:
465
+
- MuPDF handles these ligatures: "fi", "ff", "fl", "ft", "st", "ffi", and "ffl". If the page contains e.g. ligature "fi", you will find the following two character items subsequent to each other::
466
+
467
+
(102, glyph, (x, y), width) # 102 = ord("f")
468
+
(105, -1, (x, y), 0) # 105 = ord("i")
469
+
470
+
- This means that the ligature character components are shown combined within the space given by width. It is up to you, how you want to handle these cases in your text extraction. This is similar to ``page.get_text("rawdict")``: a glyph id is never available there, but you can assume a ligature if you encounter one of the character combinations above, having the **same origin** and ``bbox.width = 0`` except for the first character.
471
+
472
+
405
473
-----
406
474
407
475
.. attribute:: Page.is_wrapped
@@ -412,13 +480,13 @@ Yet others are handy, general-purpose utilities.
412
480
413
481
.. method:: Page.get_text_blocks(flags=None)
414
482
415
-
Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.getText` with the "blocks" option instead.
483
+
Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.get_text` with the "blocks" option instead.
416
484
417
485
-----
418
486
419
487
.. method:: Page.get_text_words(flags=None)
420
488
421
-
Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.getText` with the "words" option instead.
489
+
Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.get_text` with the "words" option instead.
:attr:`Link.rect` clickable area in untransformed coordinates.
21
23
:attr:`Link.uri` link destination
@@ -40,7 +42,7 @@ There is a parent-child relationship between a link and its page. If the page ob
40
42
41
43
.. method:: set_colors(colors=None, stroke=None)
42
44
43
-
Changes the "stroke" color.
45
+
PDF only: Changes the "stroke" color.
44
46
45
47
.. note:: In PDF, links are a subtype of annotations technically and **do not support fill colors**. However, to keep a consistent API, we do allow specifying a ``fill=`` parameter like with all annotations, which will be ignored with a warning.
46
48
@@ -49,6 +51,19 @@ There is a parent-child relationship between a link and its page. If the page ob
49
51
:arg dict colors: a dictionary containing color specifications. For accepted dictionary keys and values see below. The most practical way should be to first make a copy of the *colors* property and then modify this dictionary as required.
50
52
:arg sequence stroke: see above.
51
53
54
+
.. method:: set_flags(flags)
55
+
56
+
*New in v1.18.16*
57
+
58
+
Set the PDF ``/F`` property of the link annotation. See :meth:`Annot.set_flags` for details. If not a PDF, this method is a no-op.
59
+
60
+
61
+
.. attribute:: flags
62
+
63
+
*New in v1.18.16*
64
+
65
+
Return the link annotation flags, an integer (see :attr:`Annot.flags` for details). Zero if not a PDF.
0 commit comments