Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

Closed
sherrellbc opened this issue Jan 23, 2025 · 17 comments
Closed

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

sherrellbc opened this issue Jan 23, 2025 · 17 comments
Assignees
Labels
bug Something isn't working priority-low
Milestone

Comments

@sherrellbc
Copy link

sherrellbc commented Jan 23, 2025

Describe the bug
Fails to open specific PDF

To Reproduce
Steps to reproduce the behavior:

  1. Download ARM reference manual: https://developer.arm.com/documentation/ddi0487/latest/
  2. Run pdfioinfo.c example with PDF from (1)
  3. Call to pdfioFileOpen fails with NULL

Expected behavior
PDF opens and reference given.

System Information:

  • OS: Linux, Debian bookworm

Additional context
PDFIO version v1.4.0, but latest on master also fails.

I enabled debug but the output is quite dense. The final reported debug was:

...
pdfioFileFindObj(pdf=0x312d9720, number=4946) alloc_objs=42336, num_objs=42306, objs=0x7feec5ccc010
pdfioFileFindObj: objs[current=4945]=0x31344300(24339)
pdfioFileFindObj: Returning NULL
add_obj: obj=0x317a5150, ->pdf=0x312d9720, ->number=4946, ->offset=0
add_obj: Inserting at 2076
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '318'.
load_obj_stream: 4946 at offset 318
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '4704'.
pdfioFileFindObj(pdf=0x312d9720, number=4704) alloc_objs=42336, num_objs=42307, objs=0x7feec5ccc010
pdfioFileFindObj: objs[current=4703]=0x3133d160(23566)
pdfioFileFindObj: Returning NULL
add_obj: obj=0x317a51c0, ->pdf=0x312d9720, ->number=4704, ->offset=0
add_obj: Inserting at 1995
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '317'.
load_obj_stream: 4704 at offset 317
get_char: Consuming 130 bytes.
stream_read: No predictor.
@michaelrsweet
Copy link
Owner

OK, I added some missing error reporting - turns out there is a problem with a compressed object stream.

Continuing to investigate...

@michaelrsweet michaelrsweet self-assigned this Jan 23, 2025
@michaelrsweet michaelrsweet added bug Something isn't working priority-low labels Jan 23, 2025
@michaelrsweet michaelrsweet added this to the Stable milestone Jan 23, 2025
@michaelrsweet
Copy link
Owner

OK, so the compressed object stream code ignored the count in the object dictionary. Relatively simple change to fix that:

[master 9e2f3ab] Fix reading of compressed object streams (Issue #92)

@sherrellbc
Copy link
Author

Fantastic response time. Thanks for looking into this AND fixing it. Looks like the pdfinfo test now runs, but it reports bad data for Title and Author:

         Title: ��
        Author: ��
    Created On: Wed Dec  4 20:39:34 2024
  Number Pages: 14568

Can you confirm if this information is just undefined in this PDF?

@sherrellbc
Copy link
Author

sherrellbc commented Jan 24, 2025

Hm, pdf2text.c likewise has trouble with this PDF. However, I tried it with testpdfio.pdf and it did not work either. The output stream is what appears to be elements of the control sequences from within the PDF. Perhaps I am misunderstanding the purpose of this example code, Are these expected to produce the text content of the PDF?

I did run the unit tests and they all claim to pass.

@michaelrsweet
Copy link
Owner

This PDF file (incorrectly) embeds Unicode text for some (but not all) of the metadata such as the title and author. I can look at detecting this and converting the strings to UTF-8 (they start with a Byte Order Mark or BOM so it is straight-forward to detect).

pdf2text.c is a very very very basic example of extracting plain text from the PDF page streams. It almost certainly won't work for complex PDF files (particularly if they use Unicode fonts) but has what you'd need to extract text from simple PDF reports or shipping labels.

@michaelrsweet michaelrsweet reopened this Jan 24, 2025
@sherrellbc
Copy link
Author

Fair enough. Thank you for your effort here. I suppose I better become acquainted with the PDF spec.

@michaelrsweet
Copy link
Owner

The Xpdf and Poppler projects both include utilities for extracting Unicode text from pages by interpreting the page content vs. simply "scraping" the pages like pdf2text.c does.

michaelrsweet added a commit that referenced this issue Jan 24, 2025
Specifically to support Unicode Title and Author values.
@michaelrsweet
Copy link
Owner

[master cca7383] Fix support for UTF-16 string values in dictionaries (Issue #92)

@michaelrsweet
Copy link
Owner

This is what I get now:

DDI0487L_a_a-profile_architecture_reference_manual.pdf:
         Title: 䄀爀洀글 䄀爀挀栀椀琀攀挀琀甀爀攀 刀攀昀攀爀攀渀挀攀 䴀愀渀甀愀氀Ⰰ 昀漀爀 䄀ⴀ瀀爀漀昀椀氀攀 愀爀挀栀椀琀攀挀琀甀爀攀
        Author: 愀爀洀
    Created On: Wed Dec  4 20:39:34 2024
  Number Pages: 14568

@sherrellbc
Copy link
Author

sherrellbc commented Jan 24, 2025

The Xpdf and Poppler projects both include utilities for extracting Unicode text from pages by interpreting the page content vs. simply "scraping" the pages like pdf2text.c does.

Thanks for the suggestions. After looking through your API and briefly educating myself on the PDF spec, it seems like in order to do text extraction one needs to:

  1. open per-page streams (pdfioPageOpenStream)
  2. loop through all tokens (pdfioStreamGetToken)
  3. operate on BT tokens, possible applying appropriate filter to extract raw text content
  4. possibly following obj references to other pages for, possibly, recursive application of 1, 2, and 3

I don't see an indication that any transformations are applied to the payload content of streams, so I'd need to implement that too. Does this sound about right or am I missing any steps? I imagine this is roughly what Xpdf or Poppler do before providing the raw text content.

@michaelrsweet
Copy link
Owner

Yes, you have it about right. Xpdf/Poppler also use the transforms/positioning to decide which text goes together, adding whitespace, etc.

@sherrellbc
Copy link
Author

sherrellbc commented Jan 25, 2025

I overlooked the fact that pdfioPageOpenStream's third argument is in fact a control for whether or not PDFIO itself does the decompression. However, I fear something is not quite working. So following the guide above (and indeed mostly a rip from your example pdf2text.c code) - if I open file, open stream (with decoding), and collect all the tokens until EOF (or, end of stream) the decoded data looks unexpected.

q000000c/g0001kq100100cB/1T100177T[(-(-(1(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(]T1TT[(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(-(3(-(-(-<-(-(-(-(-(-(-(-(-(-(-(1(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(]TT[(3(1(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(1(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(1(-(-(1(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(3(-(1(-(-(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(1(-(-(1(-(-(-(]TT[(-(-(-(-(-( ...

And if you pass false so that PDFIO does not attempt to decompress the stream, it looks like there is an internal error of some sort while parsing each stream.

785dffffffcdffffffe15d76fffffffa29ffffff997bfffffffapdfio/testfiles/testpdfio.pdf: Syntax error: '>6'.

785d4b230104pdfio/testfiles/testpdfio.pdf: Unterminated string literal.

782b101a2bffffff99ffffffd1ffffff8a3c39ffffffba2f5bffffffe82f00ffffffb22fffffffe458pdfio/testfiles/testpdfio.pdf: Syntax error: '<'

785d4b59187bffffff8b2b2266pdfio/testfiles/testpdfio.pdf: Syntax error: '<o'

I may well just be demonstrating my ignorance here, but I'm failing to find anything relevant in the PDF specification that would explain this decoded stream.

I have attached the test.c reproducer. This program was executed with a target of testfiles/testpdfio.pdf. Hopefully I am just doing something overtly wrong. Github rejects .c files, but luckily it is short enough to just paste here:

#include <pdfio.h>
#include <stdio.h>

bool g_decode = false;

static void pdf_parse(pdfio_file_t *pdf)
{
    pdfio_obj_t *page;
    pdfio_stream_t *stream;
    unsigned i, j, npages, nstreams;
    char buf[1ull << 20];

    npages = pdfioFileGetNumPages(pdf);
    for(i=0; i<npages; i++){
        page = pdfioFileGetPage(pdf, i);

        nstreams = pdfioPageGetNumStreams(page);
        for(j=0; j<nstreams; j++){
            if( !(stream = pdfioPageOpenStream(page, j, g_decode)) )
                continue;

            while(pdfioStreamGetToken(stream, buf, sizeof(buf)))
                fprintf(stderr, g_decode ? "%c" : "%02x", buf[0]);

            pdfioStreamClose(stream);
            fprintf(stderr, "\n");
        }
    }
}

int main(int argc, char **argv)
{
    pdfio_file_t *pdf;

    if(argc < 2){
        fprintf(stderr, "Usage: %s <file.pdf>\n", argv[0]);
        return -1;
    }

    pdf = pdfioFileOpen(argv[1], NULL, NULL, NULL, NULL);
    if(!pdf){
        fprintf(stderr, "Failed to open \'%s\'\n", argv[1]);
        return -1;
    }

    pdf_parse(pdf);
    pdfioFileClose(pdf);
    return 0;
}

@sherrellbc
Copy link
Author

Ah, I worked it out. At first I thought token meant a whole line from the PDF stream. Then I thought it might just be a single character that clients of PDFIO would use to build a state machine around, whereby they would incrementally collect components of with each successive call to pdfioStreamGetToken.

Thanks to your debug

 <parse>        Pg.0 stream.0.open()
stream_read: No predictor.
get_char: Read 'q 0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.1521( )-70
.1681(#)-37.1521(&)-263.84(')-39('
_pdfioTokenRead: state='K'
_pdfioTokenRead: Read 'q'.
_pdfioTokenFlush: Consuming 2 bytes.
_pdfioTokenFlush: Remainder '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37
.1521( )-70.1681(#)-37.1521(&)-263.84(')-39('
 <parse>    q
get_char: Read '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.1521( )-70.1
681(#)-37.1521(&)-263.84(')-39(\036)'
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '0.1'.
_pdfioTokenFlush: Consuming 4 bytes.
_pdfioTokenFlush: Remainder '0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.152
1( )-70.1681(#)-37.1521(&)-263.84(')-39(\036)'
 <parse>    0.1

I noticed that I was just printing the first character of the string token that was extracted. So a token is a component of a stream separated by whitespace?

Maybe I missed it - but it might be worth documenting what a token may be within the context of a real PDF example (like the Hello World PDF in the docs). I did read the brief in the documentation for pdfioStreamGetToken but it was yet unclear to me that tokens would manifest like this.

@sherrellbc
Copy link
Author

Sorry for polluting this ticket but this repo does not have a 'Discussion' section like some others on Github. Looking into parsing a bit further I noticed that PDFIO will strip the LineFeed character from the end of a stream command.

For instance, you can actually see it in the debug output I pasted above:

_pdfioTokenFlush: Remainder '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37
.1521( )-70.1681(#)-37.1521(&)-263.84(')-39('

In particular: 0.1 0 0 0.1 0 0 cm\012. However, when the pdfioStreamGetToken function has returned with this data you provide only cm, and remove the cm\LF off the end.

You can see the collection of this first sequence of commands.

 <parse>                q
 <parse>  71 00 00 00

 <parse>                0.1
 <parse>  30 2e 31 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0.1
 <parse>  30 2e 31 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                cm
 <parse>  63 6d 00 00

Perhaps this is somewhat inconsequential. Or perhaps this is moreso a postfix parsing question - but without some designation that the most recent input is an operator you are left only with the option of doing a comparison with all know operators for each token you receive, no?

This seems a little inefficient. Am I missing something?

@michaelrsweet
Copy link
Owner

@sherrellbc I'm pretty sure I enabled the Github discussions stuff for the PDFio project. However, since I'm always logged in and "own" the repository it could be I haven't done something needed to allow you to ask questions there?

WRT the pdfioStreamGetToken API, it is designed to return the next PDF processing token on the stream, which automatically skips whitespace and comments. I should probably add some more examples for this function along with documenting that name values start with '/'. But basically if the first character in the returned string is a letter then you have probably found a PDF operator ("false", "null", and "true" are the only exceptions since they are special values).

WRT efficiency, PDFio isn't designed for rendering PDFs (this is stated pretty clearly at the front of the documentation) but rather for manipulating them. And while a PDF renderer/RIP would likely provide you with/use a more efficient representation of the page stream, it would first need to tokenize the input... Suffice it to say that PDF is far from the most efficient format to deal with and isn't ideal for extracting arbitrary information from...

@michaelrsweet
Copy link
Owner

Feel free to add any additional documentation comments to issue #95 that I created to track the pdfioStreamGetToken documentation along with other examples/docs that would be useful for extracting information from a PDF file.

@sherrellbc
Copy link
Author

It looks like Discussions are available now. I did check, and yesterday it was not there. So thank you for this! If I have any further questions I will be sure to start something there.

Image

RE my comment on efficiency - I was not referring to PDFIO specifically. It just seemed a bit strange to me that there was no clear delimitation between token types. Perhaps a decision point around the token being a letter could work well enough. With some caveats, like you mention, for other special characters like /.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority-low
Projects
None yet
Development

No branches or pull requests

2 participants