Skip to content

tm or cm matrix with a visitor for 'extract_text'? #3377

@Apfelkuchenbemme

Description

@Apfelkuchenbemme

I was testing this example from the documentation, but print(text_body) printed an empty string. I then changed the visitor function to this:

def visitor_body(text, cm, tm, font_dict, font_size):
    #y = cm[5]
    y = tm[5]
    print(cm, tm)
    if 50 < y < 720:
        parts.append(text)

and ran it again.This produced the expected outcome, i.e. that print(text_body) printed the text of the 4th page of the PDF without header and footer. All printed cm matrices looked like this: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]. I didn't look under the hood of pypdf itself to check where this might be used "incorrectly", but I think it's the same problem with the second example on that page.

See the SVG files created from the code of the 2nd example, that I have attached here. The first one uses the cm matrix in visitor_svg_text, and puts all of the text in the top left corner. The second one uses the tm matrix and while it won't win beauty contests, it looks a lot better:

Image
Image

Environment

Python 3.10.0, pypdf 5.7.0, Windows 10

Code + PDF

See the links to the examples in the documentation above.

Traceback

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions