Skip to content

Commit be074d5

Browse files
committed
upload v1.18.16
1 parent d1c6e30 commit be074d5

File tree

10 files changed

+228
-26
lines changed

10 files changed

+228
-26
lines changed

docs/changes.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,23 @@
11
Change Logs
22
===============
33

4+
Changes in Version 1.18.16
5+
---------------------------
6+
* **Fixed** issue `#1184 <https://github.com/pymupdf/PyMuPDF/issues/1184>`_. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font).
7+
8+
* **Fixed** issue `#1154 <https://github.com/pymupdf/PyMuPDF/issues/1154>`_. Text search hits should now be correct when ``clip`` is specified.
9+
10+
* **Fixed** issue `#1152 <https://github.com/pymupdf/PyMuPDF/issues/1152>`_.
11+
12+
* **Fixed** issue `#1146 <https://github.com/pymupdf/PyMuPDF/issues/1146>`_.
13+
14+
* **Added** :attr:`Link.flags` and :meth:`Link.set_flags` to the :ref:`Link` class. Implements enhancement requests `#1187 <https://github.com/pymupdf/PyMuPDF/issues/1187>`_.
15+
16+
* **Added** option to *simulate* :meth:`TextWriter.fill_textbox` output for predicting the number of lines, that a given text would occupy in the textbox.
17+
18+
* **Added** text output support as subcommand `gettext` to the ``fitz`` CLI module. Most importantly, original **physical text layout** reproduction is now supported.
19+
20+
421
Changes in Version 1.18.15
522
---------------------------
623
* **Fixed** issue `#1088 <https://github.com/pymupdf/PyMuPDF/issues/1088>`_. Removing an annotation's fill color should now work again both ways, using the ``fill_color=[]`` argument in :meth:`Annot.update` as well as ``fill=[]`` in :meth:`Annot.set_colors`.

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
# built documents.
4444
#
4545
# The full version, including alpha/beta/rc tags.
46-
release = "1.18.15"
46+
release = "1.18.16"
4747

4848
# The short X.Y version
4949
version = release

docs/document.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1103,6 +1103,8 @@ For details on **embedded files** refer to Appendix 3.
11031103

11041104
:arg str user_pw: *(new in version 1.16.0)* set the document's user password.
11051105

1106+
.. note:: The method does not check, whether a file of that name already exists, will hence not ask for confirmation, and overwrite the file. It is your responsibility as a programmer to handle this.
1107+
11061108
.. method:: ez_save(*args, **kwargs)
11071109

11081110
*(New in v1.18.11)*

docs/faq.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2103,6 +2103,38 @@ If it is *False* or if you want to be on the safe side, pick one of the followin
21032103
page.wrap_contents()
21042104
>>> # start inserting text, images or annotations here
21052105

2106+
2107+
Missing or Unreadable Extracted Text
2108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2109+
This can be a number of different problems.
2110+
2111+
Problem: no text
2112+
^^^^^^^^^^^^^^^^
2113+
Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing.
2114+
2115+
Cause
2116+
^^^^^^
2117+
1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF).
2118+
2. The PDF creator used no font, but **simulated** text by painting it, using little lines and curves. E.g. a capital "D" could be painted by a line "|" and a left-open semi-circle, an "o" by an ellipse, and so on.
2119+
2120+
Solution
2121+
^^^^^^^^^^
2122+
Use an OCR software like `OCRmyPDF <https://pypi.org/project/ocrmypdf/>`_ to insert a hidden text layer underneath the visible page. The resulting PDF should behave as expected.
2123+
2124+
Problem: unreadable text
2125+
^^^^^^^^^^^^^^^^^^^^^^^^
2126+
Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled.
2127+
2128+
Cause
2129+
^^^^^^
2130+
1. The single characters are redable as such (no "<?>" symbols), but the sequence in which the text is **coded in the file** deviates from the reading order. The motivation behind may be technical or protection of data against unwanted copies.
2131+
2. Many "<?>" symbols occur indicating MuPDF could not interpret these characters. The PDF creator may haved used a font that displays readable text, but obfuscates the unicode character that leads to the readable symbol (glyph).
2132+
2133+
Solution
2134+
^^^^^^^^
2135+
1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``.
2136+
2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page.
2137+
21062138
--------------------------
21072139

21082140
Low-Level Interfaces

docs/functions.rst

Lines changed: 71 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ Yet others are handy, general-purpose utilities.
4040
:meth:`Page.run` run a page through a device
4141
:meth:`Page.read_contents` PDF only: get complete, concatenated /Contents source
4242
:meth:`Page.wrap_contents` wrap contents with stacking commands
43+
:meth:`Page._get_texttrace()` low-level text information
4344
:attr:`Page.is_wrapped` check whether contents wrapping is present
4445
:meth:`planish_line` matrix to map a line to the x-axis
4546
:meth:`paper_size` return width, height for a known paper format
@@ -396,12 +397,79 @@ Yet others are handy, general-purpose utilities.
396397

397398
-----
398399

399-
.. method:: Page.wrap_contents
400+
.. method:: Page.wrap_contents()
400401

401402
Put string pair "q" / "Q" before, resp. after a page's */Contents* object(s) to ensure that any "geometry" changes are **local** only.
402403

403404
Use this method as an alternative, minimalistic version of :meth:`Page.clean_contents`. Its advantage is a small footprint in terms of processing time and impact on the data size of incremental saves.
404405

406+
-----
407+
408+
.. method:: Page._get_texttrace()
409+
410+
*New in v1.18.16*
411+
412+
Return low-level text information of the page (**all** document types). This is a list of Python dictionaries with the following content::
413+
414+
{
415+
'ascender': 0.75, # font ascender (1)
416+
'bidi': 0, # bidirectional level (1)
417+
'chars': ( # char information, tuple[tuple]
418+
(32, # unicode (4)
419+
3, # glyph id (font dependent)
420+
(470.3800354003906, # origin.x (1)
421+
755.3758544921875), # origin.y (1)
422+
2.495859366375953 # width (points)
423+
),
424+
),
425+
'color': (0.0,), # text color, tuple[float] (1)
426+
'colorspace': 1, # number of colorspace components (1)
427+
'descender': -0.25, # font descender (1)
428+
'dir': (1.0, 0.0), # writing direction (1)
429+
'flags': 4, # font flags (1)
430+
'font': 'Calibri', # font name (1)
431+
'linewidth': 0.5519999980926514, # last know line width value (3)
432+
'opacity': 1.0, # alpha value of the text (5)
433+
'scissor': (1.0, 1.0, -1.0, -1.0), # <ignore>
434+
'size': 11.039999961853027, # font size (1)
435+
'spacewidth': 2.495859366375953, # width of space character (synthesized)
436+
'type': 0, # span type (2)
437+
'wmode': 0 # writing mode (1)
438+
}
439+
440+
Details:
441+
442+
1. Same meaning as explained in :ref:`TextPage`.
443+
2. There are 5 text span types:
444+
445+
0. Filled text -- equivalent to PDF text rendering mode 0 (``0 Tr``), only the characters' inside is shown.
446+
1. Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
447+
2. Clipped text -- details yet unknown.
448+
3. Clip-stroked text -- details yet unknown.
449+
4. Ignored text -- equivalent to ``3 Tr``.
450+
451+
3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of stroked lines. This value may not be provided in the data. In this case, a value of ``span["size"] * 0,05`` is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent text span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except one is ``span["type"] = 0`` and the other one ``span["type"] = 1``.
452+
4. For data compactness, the character's unicode is provided here. Use function ``chr()`` for the character itself.
453+
5. The alpha / pacity value of the span's text, 0 <= opacity <= 1. Zero is invisible text, 1 (100%) covers what is behind.
454+
455+
Here is a list of similarities and differences of ``page._get_texttrace()`` compared to ``page.get_text("rawdict")``:
456+
457+
* The method is up to **twice as fast.**
458+
* The returned information is very **much smaller in size.**
459+
* Additional types of text **invisibility can be detected**: opacity = 0 and type = 4.
460+
* Character bboxes are not provided; if needed, compute them from available information.
461+
* If MuPDF returns unicode 0xFFFD (65533) for unrecognized characters, you may still be able to deduct required information from the glyph id.
462+
* The ``span["chars"]`` **contains no spaces**, **except** the document creator has coded them. They **will never be generated** like it happens in :meth:`Page.get_text` methods. To provide some help for doing your own computations here, the width of a space character is given. This value is derived from the font where possible. Otherwise a synthetic value is taken.
463+
* There is no effort to organize text like it happens for a :ref:`TextPage` (the hierarchy of blocks, lines, spans, and characters). Characters are simply extracted in sequence, one by one, and put in a span. Whenever any of the span's characteristics change, a new span is started. So you may find characters with different ``origin.y`` values in the same span. You cannot assume, that span characters are sorted in any particular order -- you must make sense of the info yourself, taking ``span["dir"]``, ``span["wmode"]``, etc. into account.
464+
* Ligatures are represented like this:
465+
- MuPDF handles these ligatures: "fi", "ff", "fl", "ft", "st", "ffi", and "ffl". If the page contains e.g. ligature "fi", you will find the following two character items subsequent to each other::
466+
467+
(102, glyph, (x, y), width) # 102 = ord("f")
468+
(105, -1, (x, y), 0) # 105 = ord("i")
469+
470+
- This means that the ligature character components are shown combined within the space given by width. It is up to you, how you want to handle these cases in your text extraction. This is similar to ``page.get_text("rawdict")``: a glyph id is never available there, but you can assume a ligature if you encounter one of the character combinations above, having the **same origin** and ``bbox.width = 0`` except for the first character.
471+
472+
405473
-----
406474

407475
.. attribute:: Page.is_wrapped
@@ -412,13 +480,13 @@ Yet others are handy, general-purpose utilities.
412480

413481
.. method:: Page.get_text_blocks(flags=None)
414482

415-
Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.getText` with the "blocks" option instead.
483+
Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.get_text` with the "blocks" option instead.
416484

417485
-----
418486

419487
.. method:: Page.get_text_words(flags=None)
420488

421-
Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.getText` with the "words" option instead.
489+
Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.get_text` with the "words" option instead.
422490

423491
-----
424492

docs/images/img-layout-text.jpg

378 KB
Loading

docs/link.rst

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@ There is a parent-child relationship between a link and its page. If the page ob
1212
========================= ============================================
1313
:meth:`Link.set_border` modify border properties
1414
:meth:`Link.set_colors` modify color properties
15+
:meth:`Link.set_flags` modify link flags
1516
:attr:`Link.border` border characteristics
1617
:attr:`Link.colors` border line color
1718
:attr:`Link.dest` points to destination details
1819
:attr:`Link.is_external` external destination?
20+
:attr:`Link.flags` link annotation flags
1921
:attr:`Link.next` points to next link
2022
:attr:`Link.rect` clickable area in untransformed coordinates.
2123
:attr:`Link.uri` link destination
@@ -40,7 +42,7 @@ There is a parent-child relationship between a link and its page. If the page ob
4042

4143
.. method:: set_colors(colors=None, stroke=None)
4244

43-
Changes the "stroke" color.
45+
PDF only: Changes the "stroke" color.
4446

4547
.. note:: In PDF, links are a subtype of annotations technically and **do not support fill colors**. However, to keep a consistent API, we do allow specifying a ``fill=`` parameter like with all annotations, which will be ignored with a warning.
4648

@@ -49,6 +51,19 @@ There is a parent-child relationship between a link and its page. If the page ob
4951
:arg dict colors: a dictionary containing color specifications. For accepted dictionary keys and values see below. The most practical way should be to first make a copy of the *colors* property and then modify this dictionary as required.
5052
:arg sequence stroke: see above.
5153

54+
.. method:: set_flags(flags)
55+
56+
*New in v1.18.16*
57+
58+
Set the PDF ``/F`` property of the link annotation. See :meth:`Annot.set_flags` for details. If not a PDF, this method is a no-op.
59+
60+
61+
.. attribute:: flags
62+
63+
*New in v1.18.16*
64+
65+
Return the link annotation flags, an integer (see :attr:`Annot.flags` for details). Zero if not a PDF.
66+
5267

5368
.. attribute:: colors
5469

0 commit comments

Comments
 (0)