-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to open large/complex PDF with NULL from pdfioFileOpen #92
Comments
OK, I added some missing error reporting - turns out there is a problem with a compressed object stream. Continuing to investigate... |
Fantastic response time. Thanks for looking into this AND fixing it. Looks like the pdfinfo test now runs, but it reports bad data for Title and Author:
Can you confirm if this information is just undefined in this PDF? |
Hm, I did run the unit tests and they all claim to pass. |
This PDF file (incorrectly) embeds Unicode text for some (but not all) of the metadata such as the title and author. I can look at detecting this and converting the strings to UTF-8 (they start with a Byte Order Mark or BOM so it is straight-forward to detect). pdf2text.c is a very very very basic example of extracting plain text from the PDF page streams. It almost certainly won't work for complex PDF files (particularly if they use Unicode fonts) but has what you'd need to extract text from simple PDF reports or shipping labels. |
Fair enough. Thank you for your effort here. I suppose I better become acquainted with the PDF spec. |
Specifically to support Unicode Title and Author values.
This is what I get now:
|
Thanks for the suggestions. After looking through your API and briefly educating myself on the PDF spec, it seems like in order to do text extraction one needs to:
I don't see an indication that any transformations are applied to the payload content of streams, so I'd need to implement that too. Does this sound about right or am I missing any steps? I imagine this is roughly what Xpdf or Poppler do before providing the raw text content. |
Yes, you have it about right. Xpdf/Poppler also use the transforms/positioning to decide which text goes together, adding whitespace, etc. |
I overlooked the fact that
And if you pass
I may well just be demonstrating my ignorance here, but I'm failing to find anything relevant in the PDF specification that would explain this decoded stream. I have attached the
|
Ah, I worked it out. At first I thought token meant a whole line from the PDF stream. Then I thought it might just be a single character that clients of PDFIO would use to build a state machine around, whereby they would incrementally collect components of with each successive call to Thanks to your debug
I noticed that I was just printing the first character of the string token that was extracted. So a token is a component of a stream separated by whitespace? Maybe I missed it - but it might be worth documenting what a token may be within the context of a real PDF example (like the Hello World PDF in the docs). I did read the brief in the documentation for |
Sorry for polluting this ticket but this repo does not have a 'Discussion' section like some others on Github. Looking into parsing a bit further I noticed that PDFIO will strip the LineFeed character from the end of a stream command. For instance, you can actually see it in the debug output I pasted above:
In particular: You can see the collection of this first sequence of commands.
Perhaps this is somewhat inconsequential. Or perhaps this is moreso a postfix parsing question - but without some designation that the most recent input is an operator you are left only with the option of doing a comparison with all know operators for each token you receive, no? This seems a little inefficient. Am I missing something? |
@sherrellbc I'm pretty sure I enabled the Github discussions stuff for the PDFio project. However, since I'm always logged in and "own" the repository it could be I haven't done something needed to allow you to ask questions there? WRT the WRT efficiency, PDFio isn't designed for rendering PDFs (this is stated pretty clearly at the front of the documentation) but rather for manipulating them. And while a PDF renderer/RIP would likely provide you with/use a more efficient representation of the page stream, it would first need to tokenize the input... Suffice it to say that PDF is far from the most efficient format to deal with and isn't ideal for extracting arbitrary information from... |
Feel free to add any additional documentation comments to issue #95 that I created to track the pdfioStreamGetToken documentation along with other examples/docs that would be useful for extracting information from a PDF file. |
Describe the bug
Fails to open specific PDF
To Reproduce
Steps to reproduce the behavior:
pdfioinfo.c
example with PDF from (1)pdfioFileOpen
fails with NULLExpected behavior
PDF opens and reference given.
System Information:
Additional context
PDFIO version v1.4.0, but latest on master also fails.
I enabled debug but the output is quite dense. The final reported debug was:
The text was updated successfully, but these errors were encountered: