You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [](https://pepy.tech/project/pymupdf)
6
8
7
-
On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [](https://pepy.tech/project/pymupdf)
PyMuPDF (current version 1.19.2) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
14
+
PyMuPDF (current version 1.19.3) is a Python binding with support for [MuPDF](https://mupdf.com/) (current version 1.19.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.
14
15
15
16
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
16
17
@@ -59,7 +60,11 @@ Have a look at the basic [demos](https://github.com/pymupdf/PyMuPDF-Utilities/tr
59
60
Documentation is written using Sphinx and is available in various formats from the following sources. It currently is a combination of reference guide and user manual. For a **quick start** look at the [tutorial](https://pymupdf.readthedocs.io/en/latest/tutorial.html) and the [recipes](https://pymupdf.readthedocs.io/en/latest/faq.html) chapters.
60
61
61
62
* You can view it online at [Read the Docs](https://readthedocs.org/projects/pymupdf/). This site also provides download options for PDF.
63
+
<<<<<<< Updated upstream
62
64
* The search function on Read the Docs does not work for me currently. If you want a working searchable local version, please download a zipped HTML for [here](https://github.com/pymupdf/PyMuPDF-optional-material/tree/master/doc/pymupdf.zip).
65
+
=======
66
+
* The search function on Read the Docs does not work for me currently. If you want a working searchable local version, please download a zipped HTML from [here](https://github.com/pymupdf/PyMuPDF-optional-material/tree/master/doc/pymupdf.zip).
67
+
>>>>>>> Stashed changes
63
68
* Find a Windows help file [here](https://github.com/pymupdf/PyMuPDF-optional-material/tree/master/doc/PyMuPDF.chm).
64
69
65
70
The latest changelog can be viewed [here](https://pymupdf.readthedocs.io/en/latest/changes.html).
Copy file name to clipboardExpand all lines: changes.rst
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,23 @@ Change Log
3
3
4
4
------
5
5
6
+
**Changes in Version 1.19.3**
7
+
8
+
This patch version implements minor improvements for :ref:`Pixmap` and also some important fixes.
9
+
10
+
* **Fixed** `#1351 <https://github.com/pymupdf/PyMuPDF/discussions/1351>`_. Reverted code that introduced the memory growth in v1.18.15.
11
+
* **Fixed** `#1417 <https://github.com/pymupdf/PyMuPDF/discussions/1417>`_. Developped circumvention for growth of open file handles using :meth:`Document.insert_pdf`.
12
+
* **Fixed** `#1418 <https://github.com/pymupdf/PyMuPDF/discussions/1418>`_. Developped circumvention for memory growth using :meth:`Document.insert_pdf`.
13
+
* **Fixed** `#1430 <https://github.com/pymupdf/PyMuPDF/discussions/1430>`_. Developped circumvention for mass pixmap generations of document pages.
14
+
* **Fixed** `#1433 <https://github.com/pymupdf/PyMuPDF/discussions/1433>`_. Solves a bbox error for some Type 3 font in PyMuPDF text processing.
15
+
* **Added** :meth:`Pixmap.color_topusage` to determine the share of the most frequently used color. Solves `#1397 <https://github.com/pymupdf/PyMuPDF/discussions/1397>`_.
16
+
* **Added** :meth:`Pixmap.warp` which makes a new pixmap from a given arbitrary convex quad inside the pixmap.
17
+
* **Added** :meth:`Rect.torect` and :meth:`IRect.torect` which compute a matrix that transforms to a given other rectangle.
18
+
* **Changed** :meth:`Pixmap.color_count` to also return the count of each color.
19
+
* **Changed** :meth:`Page.get_texttrace` to also return correct span and character bboxes if ``span["dir"] != (1, 0)``.
20
+
21
+
------
22
+
6
23
**Changes in Version 1.19.2**
7
24
8
25
This patch version implements minor improvements for :meth:`Page.get_drawings` and also some important fixes.
Copy file name to clipboardExpand all lines: docs/changes.rst
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,23 @@ Change Log
3
3
4
4
------
5
5
6
+
**Changes in Version 1.19.3**
7
+
8
+
This patch version implements minor improvements for :ref:`Pixmap` and also some important fixes.
9
+
10
+
* **Fixed** `#1351 <https://github.com/pymupdf/PyMuPDF/discussions/1351>`_. Reverted code that introduced the memory growth in v1.18.15.
11
+
* **Fixed** `#1417 <https://github.com/pymupdf/PyMuPDF/discussions/1417>`_. Developped circumvention for growth of open file handles using :meth:`Document.insert_pdf`.
12
+
* **Fixed** `#1418 <https://github.com/pymupdf/PyMuPDF/discussions/1418>`_. Developped circumvention for memory growth using :meth:`Document.insert_pdf`.
13
+
* **Fixed** `#1430 <https://github.com/pymupdf/PyMuPDF/discussions/1430>`_. Developped circumvention for mass pixmap generations of document pages.
14
+
* **Fixed** `#1433 <https://github.com/pymupdf/PyMuPDF/discussions/1433>`_. Solves a bbox error for some Type 3 font in PyMuPDF text processing.
15
+
* **Added** :meth:`Pixmap.color_topusage` to determine the share of the most frequently used color. Solves `#1397 <https://github.com/pymupdf/PyMuPDF/discussions/1397>`_.
16
+
* **Added** :meth:`Pixmap.warp` which makes a new pixmap from a given arbitrary convex quad inside the pixmap.
17
+
* **Added** :meth:`Rect.torect` and :meth:`IRect.torect` which compute a matrix that transforms to a given other rectangle.
18
+
* **Changed** :meth:`Pixmap.color_count` to also return the count of each color.
19
+
* **Changed** :meth:`Page.get_texttrace` to also return correct span and character bboxes if ``span["dir"] != (1, 0)``.
20
+
21
+
------
22
+
6
23
**Changes in Version 1.19.2**
7
24
8
25
This patch version implements minor improvements for :meth:`Page.get_drawings` and also some important fixes.
Copy file name to clipboardExpand all lines: docs/document.rst
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,7 @@ For details on **embedded files** refer to Appendix 3.
49
49
:meth:`Document.find_bookmark` retrieve page location after layouting document
50
50
:meth:`Document.fullcopy_page` PDF only: duplicate a page
51
51
:meth:`Document.get_layer` PDF only: lists of OCGs in ON, OFF, RBGroups
52
+
:meth:`Document.get_layers` PDF only: list of optional content configurations
52
53
:meth:`Document.get_oc` PDF only: get OCG /OCMD xref of image / form xobject
53
54
:meth:`Document.get_ocgs` PDF only: info on all optional content groups
54
55
:meth:`Document.get_ocmd` PDF only: retrieve definition of an :data:`OCMD`
@@ -76,7 +77,6 @@ For details on **embedded files** refer to Appendix 3.
76
77
:meth:`Document.journal_redo` PDF only: redo current operation
77
78
:meth:`Document.journal_save` PDF only: save joural to a file
78
79
:meth:`Document.journal_load` PDF only: load joural from a file
79
-
:meth:`Document.layer_configs` PDF only: list of optional content configurations
80
80
:meth:`Document.layer_ui_configs` PDF only: list of optional content intents
81
81
:meth:`Document.layout` re-paginate the document (if supported)
82
82
:meth:`Document.load_page` read a page
@@ -226,13 +226,13 @@ For details on **embedded files** refer to Appendix 3.
226
226
:arg int ocxref: the :data:`xref` number of an :data:`OCG` / :data:`OCMD`. If not zero, an invalid reference raises an exception. If zero, any OC reference is removed.
227
227
228
228
229
-
.. method:: layer_configs()
229
+
.. method:: get_layers()
230
230
231
231
*(New in v1.18.3)*
232
232
233
233
Show optional layer configurations. There always is a standard one, which is not included in the response.
234
234
235
-
>>> for item in doc.layer_configs: print(item)
235
+
>>> for item in doc.get_layers(): print(item)
236
236
{'number': 0, 'name': 'my-config', 'creator': ''}
237
237
>>> # use 'number' as config identifyer in add_ocg
Copy file name to clipboardExpand all lines: docs/faq.rst
+4-89Lines changed: 4 additions & 89 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -706,97 +706,12 @@ The text sequence extracted from a page modified in this way will look like this
706
706
2. header line
707
707
3. footer line
708
708
709
-
PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original.
709
+
PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original:
710
710
711
-
As a starting point take the above mentioned `script <https://github.com/pymupdf/PyMuPDF/wiki/How-to-extract-text-from-a-rectangle>`_ and then use the full page rectangle.
712
-
713
-
On rare occasions, when the PDF creator has been "over-creative", extracted text does not even keep the correct reading sequence of **single letters**: instead of the two words "DELUXE PROPERTY" you might sometimes get an anagram, consisting of 8 words like "DEL", "XE" , "P", "OP", "RTY", "U", "R" and "E".
714
-
715
-
Such a PDF is also not searchable by all PDF viewers, but it is displayed correctly and looks harmless.
716
-
717
-
In those cases, the following function will help composing the original words of the page. The resulting list is also searchable and can be used to deliver rectangles for the found text locations::
718
-
719
-
from operator import itemgetter
720
-
from itertools import groupby
721
-
import fitz
722
-
723
-
def recover(words, rect):
724
-
""" Word recovery.
725
-
726
-
Notes:
727
-
Method 'get_textWords()' does not try to recover words, if their single
728
-
letters do not appear in correct lexical order. This function steps in
729
-
here and creates a new list of recovered words.
730
-
Args:
731
-
words: list of words as created by 'get_textWords()'
732
-
rect: rectangle to consider (usually the full page)
733
-
Returns:
734
-
List of recovered words. Same format as 'get_text_words', but left out
735
-
block, line and word number - a list of items of the following format:
736
-
[x0, y0, x1, y1, "word"]
737
-
"""
738
-
# build my sublist of words contained in given rectangle
739
-
mywords = [w for w in words if fitz.Rect(w[:4]) in rect]
740
-
741
-
# sort the words by lower line, then by word start coordinate
742
-
mywords.sort(key=itemgetter(3, 0)) # sort by y1, x0 of word rectangle
# for each line coordinate ("_"), the list of words is given
751
-
for _, words_in_line in grouped_lines:
752
-
for i, w in enumerate(words_in_line):
753
-
if i == 0: # store first word
754
-
x0, y0, x1, y1, word = w[:5]
755
-
continue
756
-
757
-
r = fitz.Rect(w[:4]) # word rect
758
-
759
-
# Compute word distance threshold as 20% of width of 1 letter.
760
-
# So we should be safe joining text pieces into one word if they
761
-
# have a distance shorter than that.
762
-
threshold = r.width / len(w[4]) / 5
763
-
if r.x0 <= x1 + threshold: # join with previous word
764
-
word += w[4] # add string
765
-
x1 = r.x1 # new end-of-word coordinate
766
-
y0 = max(y0, r.y0) # extend word rect upper bound
767
-
continue
768
-
769
-
# now have a new word, output previous one
770
-
words_out.append([x0, y0, x1, y1, word])
771
-
772
-
# store the new word
773
-
x0, y0, x1, y1, word = w[:5]
774
-
775
-
# output word waiting for completion
776
-
words_out.append([x0, y0, x1, y1, word])
777
-
778
-
return words_out
779
-
780
-
def search_for(text, words):
781
-
""" Search for text in items of list of words
782
-
783
-
Notes:
784
-
Can be adjusted / extended in obvious ways, e.g. using regular
785
-
expressions, or being case insensitive, or only looking for complete
786
-
words, etc.
787
-
Args:
788
-
text: string to be searched for
789
-
words: list of items in format delivered by 'get_text_words()'.
790
-
Returns:
791
-
List of rectangles, one for each found locations.
792
-
"""
793
-
rect_list = []
794
-
for w in words:
795
-
if text in w[4]:
796
-
rect_list.append(fitz.Rect(w[:4]))
797
-
798
-
return rect_list
711
+
1. Use ``sort`` parameter of :meth:`Page.get_text`. It will sort the output from top-left to bottom-right (ignored for XHTML, HTML and XML output).
712
+
2. Use the ``fitz`` module in CLI: ``python -m fitz gettext ...``, which produces a text file where text has been re-arranged in layout-preserving mode. Many options are available to control the output.
799
713
714
+
You can also use the above mentioned `script <https://github.com/pymupdf/PyMuPDF/wiki/How-to-extract-text-from-a-rectangle>`_ with your modifications.
0 commit comments