-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantics of null array elements in logical structure objects that can be arrays #308
Comments
@bdoubrov @DuffJohnson @mrbhardy - please weigh in. Happy if you wish to take discussion to another TWG first... |
I would say that To me the presence of
|
Re: StructTreeRoot/ParentTree . . .
This would presumably indicate that there is a MCS content item that has no parent element (whose reference must be at that index in the array)! Such orphaned content items (with an MCID) seem unlikely to actually exist, so what meaning does this "popular implementation" give to such use of 'null'? : It would appear at least one popular implementation uses 'null`. |
Re StructTreeRoot/K : Table 354 clearly states that this either a single dictionary or an array of dictionaries. But the null object is definitely not a dictionary, so how can this value be legal here? |
So where an element in the array is So consider these situations:
Which ones do you think are valid? |
StructTreeRoot/ParentTree is more awkward to do as an example as it is a number tree. From Table 354 "the value shall be an array of references to the parent elements of those marked-content sequences" - can this array have |
Using |
I'm referring to the value of the name-/number-tree that is an array and that array then having But I would be nervous about making global assumptions that every name-tree or number-tree in PDF can always accept |
You are right, I meant
I agree this is different from having |
Yes indeed, so what can the null values possibly mean here? : [23 0 R null null 24 0 R 24 0 R] But note that it is not possible to remove values from these arrays as they are indexed by the fixed MCIDs of actual content items. But since each such content item has a parent structure, the use of null here must surely be an error. |
In this particular example the use case was to merge two structure elements into one. So, in the structure tree two elements were deleted and one added. The two |
It seems that is just wrong. These entries are not indexed by structure elements but by content items with MCIDs. Merging a structure element does not remove its content items from the stream, they are still there and have MCIDs. The structure element that is now the parent will need to be changed, but surely every content item with an MCID will still have a parent structure item. If content items with MCIDs can exist that do not have any parent structure item, then how to represent this case should be explicitly documented in 14.7.5.4. Currently it states only this: "For a content stream containing marked-content sequences that are content items, the value shall be an array of indirect references to the sequences’ parent structure elements. |
I just ran a test over about 100,000 PDFs and a LOT of PDFs have this idiom with I did not check the content streams and whether any MCID still indexed those |
I think I can see one (more?) good reason for there being many nulls in these arrays. This cause highlights a more general failing with the documentation of how these arrays, indexed by MCIDs, should work. Some explanation of how these nulls arise: The MCID numbers of the content items in a content stream can be arbitrary integers, they need not form a 'complete integer interval' such as {0,1,2,3,4,5}; they can be any set, such as {0,2,4}, or even (in theory) {0,2,4,652}. But such a set of MCID numbers themselves must be used as indices into these arrays, thus in the second example above, the array would need to be of length 5, and in the final example 653. This is somewhat strange but it is the reality to be dealt with. But it is not covered in 14.7.5.4 or, I believe, anywhere else. Sets of MCIDs such as {0,2,4} arise vey naturally and there is no prescription for what entry to use in the "missing indices", 1 and 3 in this case. Thus "null" is used as a pragmatic solution that seems to work. This pragmatic solution needs to be documented (as does the one below). The following, distinct, possibility was previously indicated by @bdoubrov: When an existing MCID number is that of a content item that has no structural parent. It is not clear whether this should also be indicated by using a null in this array; it maybe needs a different value, such as using the root node of the structure tree. |
There is an existing discussion in subclause 14.7.5.4 "Finding structure elements from content items" regarding MCIDs and the ParentTree, but it does not cover the cases we're discussing here. Interestingly searching for MCID does not find subclause 14.7.5.4 so this info is difficult to find... Trying to summarize:
These are all file format requirements, not processor requirements. Did I miss anything? |
I have not checked all the details here Here is one extra possible refinement to consider, re: "* subclause 14.7.5.4 "Finding structure elements from content items" needs to state that null array entries shall be used for marked-content identifiers that do not have any structural parent" It might be better to recommend using something other than a "null" value for such an MCID index in the array, so as to clearly distinguish this case, when the index is the MCID of a content item, from the case of a "non-existent" MCID (the previous bullet point). One possibility is to use for this case the structure tree root. This makes some sense since, even if there are no surviving structure elements in the tree, the structure tree root must survive in order for the "parents" number tree to be defined, without which these arrays cannot not even be accessed. A useful result of this change will be a better outcome from the following events: |
That would imho not be allowed. If a MC points to a structure element as its parent, this structure element should also contain this MC as a kid, but the root can't contain content items as kid. |
A good point. So, is there a better way to make this important distinction between existing and non-existent MCIDs? |
Although, of course, the "structure tree root" is not in fact a "structure element", so this stricture (that it must have the MC item as a kid) does not apply to it. The idea is that this is simply better than using "null" as the "distinguished value" for both cases. Like "null" it is really just a name, not an actual "structure parent". |
StructTreeRoot is explicitly mentioned in the matrix with a list of allowed children, and content items are not one of them: But even if that were allowed: How would you then distinguish between "wanted" MCID put there by purpose and "unwanted" MCID which only got pushed out of other structures, e.g. because they are empty and should be ignored? Actually the main use of the null entry for us, is to hide empty MCID the code produces and it would be a bit contra productive to make them visible again by moving them into the root. |
But no one is adding any children to the root, it is merely used as an arbitrary marker, that is not "null". If no one likes it then choose something else, such as "the integer 0". |
I will do so... but it will take some time. |
It would also be good to get Adobe's input on this thread... @mrbhardy? |
@stechio wrote: "it's beyond my imagination that a processor could be forced by its users to torture tiny bunches of innocent marked-content sequences . . . " Maybe you ire should in fact be directed at the use of an array here: this could be a hangover from many decades ago when this was (intended to be) a document-wide list and hence too long for the use of any other data structure? Since it is now only page-wide, there would be no real-world problems with using a dictionary here: this would contain keys only for the actually “current content items” that are also MCSs on this page. Surely such a dictionary will never get too large for any processor limits? — although I can foresee scenarios that will produce potential numbers of such content items per page in the 100s (but only quite low 100s). Another possibility is to still allow also the use of an array, but recommend that it is used for a page only when the size of the dictionary gets too large. Such alternatives would preserve compatibility whilst making life much easier for future developers. |
I can't grasp what you are referring to... The structure element lookup mechanism has been the same since the very inception of logical structure in PDF 1.3 (see 8.4.3 (Logical Structure)) — parent structure elements for a content stream were already indexed in page-scoped arrays; MCID was already defined as "an integer marked-content identifier that uniquely identifies the marked-content sequence within its content stream", it even recommended: "Because marked-content identifiers serve as indices into an array in the structural parent tree, their assigned values should be as small as possible to conserve space in the array". Did any precursor to PDF 1.3 logical structure ever exist? My rage was about a blatantly abusive implementation: no matter how annoying a constraint may feel, a diligent implementer should always have the consequences of its decisions in mind! Paraphrasing a popular saying (uncertain attribution): always code as if the guys who end up maintaining the PDFs generated by your processor will be violent psychopaths who know where you live 😁 . (As I previously noted, the problem here was that PDF 1.3 spec didn't explicitly enforce a formal processor requirement for MCID assignment, leaving a degree of leeway to lazy/cynical interpretations) To my understanding, this array structure is the most suitable for the purpose (indexed lookups), provided that its implementers judiciously handle it (ie, canonical (zero-based, contiguous) MCIDs only).
My feeling is that trading indexes (efficiency) for keys (flexibility) would be a lazy game without significant benefits and with certain drawbacks. In real-world applications, how would a dictionary of structural parents behave over time (editing session after editing session) compared to a (canonically generated and managed) array? Sure, the array could fragment with discarded MCIDs, or possibly expand when all discarded MCIDs are reused (as enforced by my algorithm proposal); the dictionary could shrink and expand, but its memory footprint would be often larger both in serialized and deserialized forms than its counterpart. Dictionaries would be somewhat more comfortable to juggle, but where is the solid gain of them as structural parents instead of current arrays? IMO, only a benchmark over a statistically-significant collection of real-world PDFs could clarify this point. |
I updated my analysis over the null-ridden PDF file with the overall count of |
To add to @stechio's point, we generally try to avoid unconstrained dictionaries in PDF (even though in reality we do have a few, such as the role map dictionary). Historically, this was to reserve the key names as "atoms" in a popular implementation, but since dictionaries have no order, walking them to find an entry would be inefficient. Arrays seem like the right structure. I agree with the outcome above, which is that only StructParents can have nulls and I don't think we can kill this off. We also can't really mandate processor behaviors in generating these, since we're trying to just tell them how to build syntax. A note added to the section on StructParents explaining that they can have null entries, but that it has negative impacts if used broadly (e.g. global MCID counters as mentioned previously) would be the right way forward. |
Agreed that one would not want, in general, to use large dictionaries (for such purposes) unless they are implemented as hashed arrays so that look-up remains efficient. |
I ran a test over on 442 random PDFs. For that mini-corpus, the record was 67,524 There were a few other PDFs with over 60,000 I could run more but I don't think it's necessary. |
@petervwyatt as I said previously, I think this is legal if a bad choice and the most we should do is add a note saying this is a bad idea. |
Would you care to draft something? :-) |
So circling back around to my original question of where is
|
When you say "random", what kind of collection method did you apply to gather those PDFs? Were they automatically harvested on the 'net with some randomized crawling strategy, or picked from existing collections, or...? (just to have an idea of their statistical significance)
Thanks for sharing, @petervwyatt; here it is the analysis of cgc-principles-and-recommendations-fourth-edn.pdf (in the meantime, I refined my inspection code further, with additional information (see column descriptions in the footnote here below)) — as you can see, it shows the same denormalization anti-pattern as I previously encountered:
Nested content streamsAs you may have noticed, the last page paints several form XObjects whose structural marked-content sequences are mapped in their own parent tree arrays. The following screenshot shows one of such occurrences (marked-content sequence with MCID=33, corresponding to the vectorial "FSC" logo, here highlighted in magenta): And this is the associated structure element ( Overall
|
A true random sample across millions of PDFs including CC_MAIN..., "Issue Tracker", GovDocs1, my own collections, PDF Association test suites, etc (many listed in https://github.com/pdf-association/pdf-corpora). I stopped sampling once it had accumulated 1GB of files - in this case that was at 442 PDFs. |
Comparative graph added to last |
as an aside,
it looks like that code uses an updated version of your PDF library. Is that version available somewhere? |
Hi @mkl-public, I saw your comment last year on the project's website, but had no heart to tell you that, after your previous request on stackoverflow, I still had no idea about its release 😢 — because of your passionate interest in PDF technology, I thought to offset my unforgivable delay by offering you a sneak peek of some of the ongoing work, but it's still too early... I am obsessed with details, never satisfied, always pushing code to a better balance between simplicity and power... The new version is already packed with several new usable (and useful) features developed over the last three years (and I have even more ideas yet to implement, BUT for another dev cycle, of course 😅 ); I'm currently completing the implementation of logical structure and tagged PDF, then the current dev cycle will be closed 🥳! I will publish a pre-release here on github, waiting for feedback. Stay assured, I didn't forget you! 😃 thank you |
Both sample files analyzed in this thread (this for Case 1 and this for Case 2) were apparently produced by Adobe PDF Library (version 15.0), so in these cases the global MCID counter cannot be a whimsical choice dictated by laziness. Trying to figure out a rationale behind that, page 39 of Case 2 seems to give a possible clue: marked-content sequences nested in form XObjects (see EXAMPLE 5 in subclause 14.7.5.2 (Marked-content sequences as content items)). Since form XObjects are inherently reusable, one might think that the MCIDs assigned to them couldn't be assigned elsewhere in the same file because of possible collisions. But subclause 14.7.5.4 (Finding structure elements from content items) clearly states that each content stream (page object or other kind) has its own set of zero-based MCIDs (emphasis is mine):
So: an MCID is a ZERO-based index into an array of references to parent structure elements, ONE FOR EACH marked-content sequence contained WITHIN THAT content stream. Because of such definition, no collision is possible, isn't it? In turn, the only plausible reason for global MCID counter vanishes miserably... (that wouldn't be decisive, anyway, since the overwhelming majority of As Adobe PDF Library is, de facto, the gold standard among the implementations of ISO PDF, it would be great if @mrbhardy could let us reach out to its dev team to understand the rationale of their solution. |
@stechio the PDF Library doesn't stop people doing very low-level control or setting things up incorrectly like this. It power most of our Desktop products and they generally produce good Tagged PDF, so I don't think this is a problem in the core library. It is possible that a product using it is setting it up in a way that produces this unfortunate outcome. I'm happy to draft a note saying that it is important to start MCIDs (start just meaning the lowest value, not first) at zero on each page and within XObjects. |
@mrbhardy I understand your point (also my initial analysis supposed it could be the result of a custom arrangement in the generation of the logical structure) and I am certain that the PDF Library delivers top-quality documents per-se; yet, IMHO, it could be useful to report this possible issue, just in case. |
PDF TWG agree:
|
very minor nit:
I think this is a good change, but "highly recommended" reads a little too normative to me (and turning this into a normative recommendation is probably too heavy-handed). Maybe we can write the note more matter-of-factly as follows:
|
Related to Errata #157 and Arlington Issue #90, but specifically for logical structure. This is effectively checking the rules for any objects that can be an array (either directly or as a value in a name or number-tree) - can the following array objects in logical structure contain
null
objects as array elements?StructTreeRoot/K array - can this have
null
array elements?StructTreeRoot/ParentTree - specifically when the value in the number-tree is an array "For a page object or content stream containing marked-content sequences that are content items, the value shall be an array of references to the parent elements of those marked-content sequences." - can an element in that array be
null
? It would appear at least one popular implementation uses 'null`...StructElem/K - noting that the array format can be an array of various kinds of dicts, or an array of integers
StructElem/A - noting also that this array can have an optional revision number after a dict/stream entry so there is potential ambiguity as to whether a
null
is a dict/stream or revision in the array elementsStructElem/C - noting also that this array can have an optional revision number after a name so there is potential ambiguity as to whether a
null
is a name or revision in the array elementsStructElem/Ref
The text was updated successfully, but these errors were encountered: