Fail to open large/complex PDF with NULL from pdfioFileOpen #92

sherrellbc · 2025-01-23T18:17:47Z

Describe the bug
Fails to open specific PDF

To Reproduce
Steps to reproduce the behavior:

Download ARM reference manual: https://developer.arm.com/documentation/ddi0487/latest/
Run pdfioinfo.c example with PDF from (1)
Call to pdfioFileOpen fails with NULL

Expected behavior
PDF opens and reference given.

System Information:

OS: Linux, Debian bookworm

Additional context
PDFIO version v1.4.0, but latest on master also fails.

I enabled debug but the output is quite dense. The final reported debug was:

...
pdfioFileFindObj(pdf=0x312d9720, number=4946) alloc_objs=42336, num_objs=42306, objs=0x7feec5ccc010
pdfioFileFindObj: objs[current=4945]=0x31344300(24339)
pdfioFileFindObj: Returning NULL
add_obj: obj=0x317a5150, ->pdf=0x312d9720, ->number=4946, ->offset=0
add_obj: Inserting at 2076
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '318'.
load_obj_stream: 4946 at offset 318
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '4704'.
pdfioFileFindObj(pdf=0x312d9720, number=4704) alloc_objs=42336, num_objs=42307, objs=0x7feec5ccc010
pdfioFileFindObj: objs[current=4703]=0x3133d160(23566)
pdfioFileFindObj: Returning NULL
add_obj: obj=0x317a51c0, ->pdf=0x312d9720, ->number=4704, ->offset=0
add_obj: Inserting at 1995
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '317'.
load_obj_stream: 4704 at offset 317
get_char: Consuming 130 bytes.
stream_read: No predictor.

The text was updated successfully, but these errors were encountered:

michaelrsweet · 2025-01-23T20:12:33Z

OK, I added some missing error reporting - turns out there is a problem with a compressed object stream.

Continuing to investigate...

michaelrsweet · 2025-01-23T20:28:41Z

OK, so the compressed object stream code ignored the count in the object dictionary. Relatively simple change to fix that:

[master 9e2f3ab] Fix reading of compressed object streams (Issue #92)

sherrellbc · 2025-01-23T22:27:14Z

Fantastic response time. Thanks for looking into this AND fixing it. Looks like the pdfinfo test now runs, but it reports bad data for Title and Author:

         Title: ��
        Author: ��
    Created On: Wed Dec  4 20:39:34 2024
  Number Pages: 14568

Can you confirm if this information is just undefined in this PDF?

sherrellbc · 2025-01-24T00:58:03Z

Hm, pdf2text.c likewise has trouble with this PDF. However, I tried it with testpdfio.pdf and it did not work either. The output stream is what appears to be elements of the control sequences from within the PDF. Perhaps I am misunderstanding the purpose of this example code, Are these expected to produce the text content of the PDF?

I did run the unit tests and they all claim to pass.

michaelrsweet · 2025-01-24T01:52:40Z

This PDF file (incorrectly) embeds Unicode text for some (but not all) of the metadata such as the title and author. I can look at detecting this and converting the strings to UTF-8 (they start with a Byte Order Mark or BOM so it is straight-forward to detect).

pdf2text.c is a very very very basic example of extracting plain text from the PDF page streams. It almost certainly won't work for complex PDF files (particularly if they use Unicode fonts) but has what you'd need to extract text from simple PDF reports or shipping labels.

sherrellbc · 2025-01-24T14:29:42Z

Fair enough. Thank you for your effort here. I suppose I better become acquainted with the PDF spec.

michaelrsweet · 2025-01-24T15:00:48Z

The Xpdf and Poppler projects both include utilities for extracting Unicode text from pages by interpreting the page content vs. simply "scraping" the pages like pdf2text.c does.

Specifically to support Unicode Title and Author values.

michaelrsweet · 2025-01-24T15:44:39Z

[master cca7383] Fix support for UTF-16 string values in dictionaries (Issue #92)

michaelrsweet · 2025-01-24T15:45:07Z

This is what I get now:

DDI0487L_a_a-profile_architecture_reference_manual.pdf:
         Title: 䄀爀洀글 䄀爀挀栀椀琀攀挀琀甀爀攀 刀攀昀攀爀攀渀挀攀 䴀愀渀甀愀氀Ⰰ 昀漀爀 䄀ⴀ瀀爀漀昀椀氀攀 愀爀挀栀椀琀攀挀琀甀爀攀
        Author: 愀爀洀
    Created On: Wed Dec  4 20:39:34 2024
  Number Pages: 14568

sherrellbc · 2025-01-24T18:40:16Z

The Xpdf and Poppler projects both include utilities for extracting Unicode text from pages by interpreting the page content vs. simply "scraping" the pages like pdf2text.c does.

Thanks for the suggestions. After looking through your API and briefly educating myself on the PDF spec, it seems like in order to do text extraction one needs to:

open per-page streams (pdfioPageOpenStream)
loop through all tokens (pdfioStreamGetToken)
operate on BT tokens, possible applying appropriate filter to extract raw text content
possibly following obj references to other pages for, possibly, recursive application of 1, 2, and 3

I don't see an indication that any transformations are applied to the payload content of streams, so I'd need to implement that too. Does this sound about right or am I missing any steps? I imagine this is roughly what Xpdf or Poppler do before providing the raw text content.

michaelrsweet · 2025-01-24T19:07:32Z

Yes, you have it about right. Xpdf/Poppler also use the transforms/positioning to decide which text goes together, adding whitespace, etc.

sherrellbc · 2025-01-25T17:55:15Z

I overlooked the fact that pdfioPageOpenStream's third argument is in fact a control for whether or not PDFIO itself does the decompression. However, I fear something is not quite working. So following the guide above (and indeed mostly a rip from your example pdf2text.c code) - if I open file, open stream (with decoding), and collect all the tokens until EOF (or, end of stream) the decoded data looks unexpected.

q000000c/g0001kq100100cB/1T100177T[(-(-(1(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(]T1TT[(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(-(3(-(-(-<-(-(-(-(-(-(-(-(-(-(-(1(-(-(-(-(-(-(-(-(-(-(-(-<-(-(-(-(-(-(-(]TT[(3(1(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(1(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(1(-(-(1(-(-(-(-(-(-<-(-(-(-(-(-(-(-(-(-(-(-(-(-(3(-(1(-(-(-(-(-(-(-(-(-(-(-(]TT[(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(-(1(-(-(1(-(-(-(]TT[(-(-(-(-(-( ...

And if you pass false so that PDFIO does not attempt to decompress the stream, it looks like there is an internal error of some sort while parsing each stream.

785dffffffcdffffffe15d76fffffffa29ffffff997bfffffffapdfio/testfiles/testpdfio.pdf: Syntax error: '>6'.

785d4b230104pdfio/testfiles/testpdfio.pdf: Unterminated string literal.

782b101a2bffffff99ffffffd1ffffff8a3c39ffffffba2f5bffffffe82f00ffffffb22fffffffe458pdfio/testfiles/testpdfio.pdf: Syntax error: '<'

785d4b59187bffffff8b2b2266pdfio/testfiles/testpdfio.pdf: Syntax error: '<o'

I may well just be demonstrating my ignorance here, but I'm failing to find anything relevant in the PDF specification that would explain this decoded stream.

I have attached the test.c reproducer. This program was executed with a target of testfiles/testpdfio.pdf. Hopefully I am just doing something overtly wrong. Github rejects .c files, but luckily it is short enough to just paste here:

#include <pdfio.h>
#include <stdio.h>

bool g_decode = false;

static void pdf_parse(pdfio_file_t *pdf)
{
    pdfio_obj_t *page;
    pdfio_stream_t *stream;
    unsigned i, j, npages, nstreams;
    char buf[1ull << 20];

    npages = pdfioFileGetNumPages(pdf);
    for(i=0; i<npages; i++){
        page = pdfioFileGetPage(pdf, i);

        nstreams = pdfioPageGetNumStreams(page);
        for(j=0; j<nstreams; j++){
            if( !(stream = pdfioPageOpenStream(page, j, g_decode)) )
                continue;

            while(pdfioStreamGetToken(stream, buf, sizeof(buf)))
                fprintf(stderr, g_decode ? "%c" : "%02x", buf[0]);

            pdfioStreamClose(stream);
            fprintf(stderr, "\n");
        }
    }
}

int main(int argc, char **argv)
{
    pdfio_file_t *pdf;

    if(argc < 2){
        fprintf(stderr, "Usage: %s <file.pdf>\n", argv[0]);
        return -1;
    }

    pdf = pdfioFileOpen(argv[1], NULL, NULL, NULL, NULL);
    if(!pdf){
        fprintf(stderr, "Failed to open \'%s\'\n", argv[1]);
        return -1;
    }

    pdf_parse(pdf);
    pdfioFileClose(pdf);
    return 0;
}

sherrellbc · 2025-01-25T21:44:36Z

Ah, I worked it out. At first I thought token meant a whole line from the PDF stream. Then I thought it might just be a single character that clients of PDFIO would use to build a state machine around, whereby they would incrementally collect components of with each successive call to pdfioStreamGetToken.

Thanks to your debug

 <parse>        Pg.0 stream.0.open()
stream_read: No predictor.
get_char: Read 'q 0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.1521( )-70
.1681(#)-37.1521(&)-263.84(')-39('
_pdfioTokenRead: state='K'
_pdfioTokenRead: Read 'q'.
_pdfioTokenFlush: Consuming 2 bytes.
_pdfioTokenFlush: Remainder '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37
.1521( )-70.1681(#)-37.1521(&)-263.84(')-39('
 <parse>    q
get_char: Read '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.1521( )-70.1
681(#)-37.1521(&)-263.84(')-39(\036)'
_pdfioTokenRead: state='N'
_pdfioTokenRead: Read '0.1'.
_pdfioTokenFlush: Consuming 4 bytes.
_pdfioTokenFlush: Remainder '0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37.152
1( )-70.1681(#)-37.1521(&)-263.84(')-39(\036)'
 <parse>    0.1

I noticed that I was just printing the first character of the string token that was extracted. So a token is a component of a stream separated by whitespace?

Maybe I missed it - but it might be worth documenting what a token may be within the context of a real PDF example (like the Hello World PDF in the docs). I did read the brief in the documentation for pdfioStreamGetToken but it was yet unclear to me that tokens would manifest like this.

sherrellbc · 2025-01-26T01:58:10Z

Sorry for polluting this ticket but this repo does not have a 'Discussion' section like some others on Github. Looking into parsing a bit further I noticed that PDFIO will strip the LineFeed character from the end of a stream command.

For instance, you can actually see it in the debug output I pasted above:

_pdfioTokenFlush: Remainder '0.1 0 0 0.1 0 0 cm\012/R8 gs\0120 0 0 1 k\012q\01210 0 0 10 0 0 cm BT\012/R26 12 Tf\0121 0 0 1 72 746.41 Tm\012[(\016)-35.1522(#)-37.1526(&)13.9923(\032)-41.1525(!)-341.84(\036)-68.1681($)-40.152(')-39(\))-72.1527(!)-341.84(\031)-72.1527(#)-37
.1521( )-70.1681(#)-37.1521(&)-263.84(')-39('

In particular: 0.1 0 0 0.1 0 0 cm\012. However, when the pdfioStreamGetToken function has returned with this data you provide only cm, and remove the cm\LF off the end.

You can see the collection of this first sequence of commands.

 <parse>                q
 <parse>  71 00 00 00

 <parse>                0.1
 <parse>  30 2e 31 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0.1
 <parse>  30 2e 31 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                0
 <parse>  30 00 00 00

 <parse>                cm
 <parse>  63 6d 00 00

Perhaps this is somewhat inconsequential. Or perhaps this is moreso a postfix parsing question - but without some designation that the most recent input is an operator you are left only with the option of doing a comparison with all know operators for each token you receive, no?

This seems a little inefficient. Am I missing something?

michaelrsweet · 2025-01-26T13:43:18Z

@sherrellbc I'm pretty sure I enabled the Github discussions stuff for the PDFio project. However, since I'm always logged in and "own" the repository it could be I haven't done something needed to allow you to ask questions there?

WRT the pdfioStreamGetToken API, it is designed to return the next PDF processing token on the stream, which automatically skips whitespace and comments. I should probably add some more examples for this function along with documenting that name values start with '/'. But basically if the first character in the returned string is a letter then you have probably found a PDF operator ("false", "null", and "true" are the only exceptions since they are special values).

WRT efficiency, PDFio isn't designed for rendering PDFs (this is stated pretty clearly at the front of the documentation) but rather for manipulating them. And while a PDF renderer/RIP would likely provide you with/use a more efficient representation of the page stream, it would first need to tokenize the input... Suffice it to say that PDF is far from the most efficient format to deal with and isn't ideal for extracting arbitrary information from...

michaelrsweet · 2025-01-26T13:48:09Z

Feel free to add any additional documentation comments to issue #95 that I created to track the pdfioStreamGetToken documentation along with other examples/docs that would be useful for extracting information from a PDF file.

sherrellbc · 2025-01-26T14:10:57Z

It looks like Discussions are available now. I did check, and yesterday it was not there. So thank you for this! If I have any further questions I will be sure to start something there.

RE my comment on efficiency - I was not referring to PDFIO specifically. It just seemed a bit strange to me that there was no clear delimitation between token types. Perhaps a decision point around the token being a letter could work well enough. With some caveats, like you mention, for other special characters like /.

michaelrsweet self-assigned this Jan 23, 2025

michaelrsweet added bug Something isn't working priority-low labels Jan 23, 2025

michaelrsweet added this to the Stable milestone Jan 23, 2025

michaelrsweet added a commit that referenced this issue Jan 23, 2025

Fix reading of compressed object streams (Issue #92)

9e2f3ab

michaelrsweet closed this as completed Jan 23, 2025

michaelrsweet reopened this Jan 24, 2025

michaelrsweet added a commit that referenced this issue Jan 24, 2025

Fix support for UTF-16 string values in dictionaries (Issue #92)

cca7383

Specifically to support Unicode Title and Author values.

michaelrsweet closed this as completed Jan 24, 2025

michaelrsweet mentioned this issue Jan 26, 2025

Add more documentation/examples for extracting content from PDFs #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

sherrellbc commented Jan 23, 2025 •

edited

Loading

michaelrsweet commented Jan 23, 2025

michaelrsweet commented Jan 23, 2025

sherrellbc commented Jan 23, 2025

sherrellbc commented Jan 24, 2025 •

edited

Loading

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 24, 2025 •

edited

Loading

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 25, 2025 •

edited

Loading

sherrellbc commented Jan 25, 2025

sherrellbc commented Jan 26, 2025

michaelrsweet commented Jan 26, 2025

michaelrsweet commented Jan 26, 2025

sherrellbc commented Jan 26, 2025

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

Fail to open large/complex PDF with NULL from pdfioFileOpen #92

Comments

sherrellbc commented Jan 23, 2025 • edited Loading

michaelrsweet commented Jan 23, 2025

michaelrsweet commented Jan 23, 2025

sherrellbc commented Jan 23, 2025

sherrellbc commented Jan 24, 2025 • edited Loading

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 24, 2025 • edited Loading

michaelrsweet commented Jan 24, 2025

sherrellbc commented Jan 25, 2025 • edited Loading

sherrellbc commented Jan 25, 2025

sherrellbc commented Jan 26, 2025

michaelrsweet commented Jan 26, 2025

michaelrsweet commented Jan 26, 2025

sherrellbc commented Jan 26, 2025

sherrellbc commented Jan 23, 2025 •

edited

Loading

sherrellbc commented Jan 24, 2025 •

edited

Loading

sherrellbc commented Jan 24, 2025 •

edited

Loading

sherrellbc commented Jan 25, 2025 •

edited

Loading