AcroPDF style forms unable to read. #510

KrishnaGadia · 2024-11-19T12:27:36Z

When using the lattice mode on a PDF generated using AcroPDF, I am unable to see the text. Only the table structure is visible.

Steps to reproduce the bug

pip install camelot-py

camelot -o cam -f csv -p3 lattice 26_Form_MGT-7-XXXX.pdf

Expected behavior

the tables should generate the text within them. Not blanks.

Code

import camelot 

pdf_path="26_Form_MGT-7-XXX.pdf"
pages = f"{3}-{3}" 
# Extract tables
tables = camelot.read_pdf(
    pdf_path,
    pages=pages,
    flavor="lattice"  # Use 'stream' for detecting lines or 'lattice' if grids are present
)

for table in tables:
    for row in table.cells:
        for cell in row:
            print(cell.text) #prints empty

# the number of rows, columns are correct

PDF

In the attachment
26_Form_MGT-7-21122016_signed.pdf

Screenshots

NA

Environment

OS: macOS
Python version: 3.9
Numpy version:
OpenCV version:
Ghostscript version:
Camelot version:

Additional context

this is visible using the fitz , PyMuPDF library, under widgets, words.

I did honestly try to figure it out myself, but got lost on the text extraction part.
I was able to view the form text via the filtz package. But it was not directly visible, instead had to look into the span, and get that information.
I really liked how you presented the data using the numpy df, and enabled the csv, json and other formats.
I am usually not active on github. Kindly reach out to [email protected]

The text was updated successfully, but these errors were encountered:

KrishnaGadia added the bug Something isn't working label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AcroPDF style forms unable to read. #510

AcroPDF style forms unable to read. #510

KrishnaGadia commented Nov 19, 2024

AcroPDF style forms unable to read. #510

AcroPDF style forms unable to read. #510

Comments

KrishnaGadia commented Nov 19, 2024